Contents lists available atScienceDirect
Big
Data
Research
www.elsevier.com/locate/bdr
Efficient
Indexing
and
Query
Processing
of
Model-View
Sensor
Data
in
the
Cloud
✩
Tian Guo
a,
∗
,
Thanasis
G. Papaioannou
b,
Karl Aberer
aaEPFL, Switzerland
bCenter for Research & Technology Hellas (CERTH), Greece
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Article history:
Availableonline11July2014 Keywords: Bigdata Index Key-valuestores MapReduce Approximation Queryoptimization
Asthenumberofsensorsthatpervadeourlives increases(e.g.,environmentalsensors,phonesensors,
etc.),the efficientmanagement ofmassiveamountofsensordata isbecomingincreasinglyimportant.
The infinite nature of sensor data poses a serious challenge for query processing even in a cloud
infrastructure. Traditional raw sensor data management systems based on relational databases lack
scalability toaccommodatelarge-scale sensordata efficiently.Thus,distributedkey-valuestores inthe
cloud arebecomingaprimetool tomanagesensordata.Model-viewsensordata management,which
stores the sensor data in the form of modeled segments, brings the additional advantages of data
compression and value interpolation. However,currently thereare notechniquesfor indexing and/or
query optimizationof the model-view sensor data in the cloud; full table scan is neededfor query
processing in the worst case. In thispaper, we propose an innovative index for modeled segments
in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and
sensorvaluedimensionsrespectively,eachofwhichhasanin-memorysearchtreeandasecondarylist
materialized inthekey-valuestore.Then,weintroduceaKVI-index–Scan–MapReducehybridapproach
toperformefficientqueryprocessinguponmodeleddatastreams.Asprovedbyaseriesofexperiments
ataprivatecloudinfrastructure,ourapproachoutperformsinquery-responsetimeandindex-updating
efficiencybothHadoop-basedparallelprocessingoftherawsensordataandmultiplealternativeindexing
approachesofmodel-viewdata.
©2014ElsevierInc.All rights reserved.
1. Introduction
Recent advances in sensor technology have enabled the vast deploymentofsensorsembeddedinuserdevicesthatmonitor var-iousphenomenafordifferentapplicationsofinterest,e.g., air/elec-trosmogpollution,radiation,earlyearthquakedetection,soil mois-ture, permafrost melting, etc. The data streams generated by a large numberof sensors are representedastime series inwhich each data point is associated with a time-stamp and a sensor value.Theserawdiscreteobservationsaretakenasthefirstcitizen in traditional relationalsensor data management systems, which leadstoanumberofproblems.Onone hand,inordertoperform analysisof the raw sensordata, users usually adopt other third-partymodelingtools(e.g.,Matlab,R andMathematica),which in-volveof tediousandtime-consumingdata extract,transform and
✩ ThisarticlebelongstoScalableComputingforBigData.
*
Correspondingauthor.E-mail addresses:tian.guo@epfl.ch(T. Guo),thanasis.papaioannou@iti.gr (T.G. Papaioannou),karl.aberer@epfl.ch(K. Aberer).
load (ETL) processes [1]. Moreover, such data analysis tools are usually used for static data set and therefore cannot be applied for onlineprocessingof sensordata streams. Onthe other hand, unboundedsensordatastreamsoftenhavemissingvaluesand un-knownerrors,whichalsoposesgreatchallengesfortraditionalraw sensordatamanagement.
To this end, various model-based sensor data management techniques [1–4] have been proposed. Model-view sensor data management leverage time series approximation and modeling techniquestosegmentthecontinuoussensortime seriesinto dis-joint intervalssuch that each interval can be approximatedby a kind ofmodel, such aspolynomial, linearor SVD. These models, foralltheintervals(orsegmentmodels),exploittheinherent cor-relations (e.g.withtimeoramongdatastreams)inthesegments ofsensortime seriesandapproximateeachsegment byacertain mathematical function within a certain error bound [5–9].Then, one can only materialize the models of the segments instead of therawdataandharvestanumberofbenefit:
First, model-view sensor data achieves compression over raw sensor dataandthereforerequireslessstorage overhead[10–12]. Second, due to the sampling frequency or sensor malfunction,
http://dx.doi.org/10.1016/j.bdr.2014.07.005 2214-5796/©2014ElsevierInc.All rights reserved.
there may be some missing values at some time points. If one queryinvolvessuchtimepoints,thentherelevantsegmentmodel can be used to estimate the values [10–12]. In some degree, model-view sensor data increases the data availability for query processing. Third, there usually exist outliers inraw sensordata, whichhas negative effect on the query results.Model-view sen-sor data removes the outliers in each interval via the segment modelandhistoricaldataonupperandlowerdatabounds,thereby diminishingtheeffectofoutliersinqueryresults.Fourth, regard-ing the similarity search or pattern discovery in sensor time-seriesmining, the segment-based time seriesrepresentation is a powerfultool fordimension reduction andsearch spacepruning
[9,13,14].
However, proposed model-basedsensordata management ap-proaches mostly employ the relational data model and process queriesbased onmaterialized views[2,3] ontop ofthe modeled segmentsofsensordata.Nowadays,theamountofdataproduced bysensorsisexponentiallyincreasing.Moreover,thereal-time pro-ductionofsensordatarequiresthedatamanagementsystemtobe abletohandlehigh-concurrentmodel-viewsensordatafrom mas-sivesensorsandthisisdifficultfortraditionalrelationaldatabase torealize.Tothisend,recentprevalentcloudstoreandcomputing techniquesprovideapromisingwaytomanagemodel-viewsensor data[15–19].
The main focus of this paper is on how to manage the seg-mentmodelsofsensordata,namelymodel-viewsensordatawith thenewlyemergingcloudstoresandcomputingtechniquesrather thanhowtoexploremoreadvancedsensordatasegmentation al-gorithms.
Inour approach, we exploit key-value stores andthe MapRe-duceparallelcomputingparadigm [18,20],twosignificant aspects ofcloudcomputing, torealizeindexingandqueryingmodel-view sensordata inthe cloud.We characterize the modeled segments ofsensortime seriesbythe time andthevalue intervalsofeach segment[16,17].Consequently,inordertoprocessrangeorpoint queries onmodel-view sensor data,our indexin the cloud store shouldexcelinprocessingintervaldata.Currentkey-valuebuilt-in indicesdonotsupportinterval-relatedoperations.Theinterval in-dex formodel-view sensor data should not only work forstatic data, but it should be dynamically updated based on the new arrivingsegments ofsensor data.Iftraditional batch-updatingor periodicalre-buildingstrategy wasapplied here[21,22],then the highspeed of sensor data yielding might lead to a large size of thenewunindexeddata,eveninshorttimeperiodsandto signif-icant indexupdating delay as well. As a result, theperformance ofqueries involvingboth indexedandunindexed datawould de-generategreatly. Therefore, the interval index in the cloud store shouldbeabletoinsertanindividualnewmodeledsegmentinan onlinemanner.
Thecontributionsofthispaperaresummarizedasfollows:
•
Innovative interval index: We propose an innovative interval index for model-view sensor data management in key-value stores, referred toas KVI-index. KVI-indexis a two-tier struc-tureconsistingofonelightweightandmemory-residentbinary searchtreeandoneindex-modeltablematerializedinthe key-value store. This composite index structure can dynamically accommodatenewsensordatasegmentsveryefficiently.•
Hybrid model-view query processing: After exploring the searchoperationsinthein-memorystructureoftheKVI-index forrangeandpointqueries,weintroduceahybridquery pro-cessingapproachthatintegratesrangescanandMapReduceto processthesegmentsinparallel.•
Intersection search: We introduce an enhanced intersection search algorithm (iSearch+) that produces consecutiveresults suitable for MapReduce processing. We theoretically analyzetheefficiencyof(iSearch+)andfindtheboundon the redun-dantindexnodesthatitreturns.
•
Experimentalevaluation:Ourframeworkhasbeenfully imple-mented,includingtheonline sensordatasegmentation, mod-eling, KVI-index andthe hybrid queryprocessing, andit has beenthoroughlyevaluatedagainstasignificantnumberof al-ternative approaches. Asexperimentally shownbasedon real sensor data,our approach significantly outperforms in terms ofqueryresponsetimeandindexupdatingefficiencyallother onesforansweringtime/valuepointandrangequeries. Theremainder ofthispaperisorganizedasfollows:Section 2summarizessome relatedwork on model-viewsensor data man-agement,intervalindexandindex-based MapReduceoptimization approaches.InSection3,we provideabriefdescriptionof sensor-datasegmentation, queryingmodel-view sensordataandthe ne-cessitytodevelop intervalindexformanaging model-viewsensor data in key-value stores. The detailed designs of our innovative KVI-indexandthehybridqueryprocessingapproacharediscussed in Sections 4 and 5 respectively. Then, in Section 6, we present thoroughexperimentalresultstoevaluateourapproachwith tradi-tionalqueryprocessingonesonbothrawsensordataandmodeled datasegments.Finally,inSection7,weconcludeourwork.
2. Relatedwork
Time series segmentation is an important research problem in theareas ofdata approximation, dataindexing anddata min-ing. A lot of work has been devoted to exploit different types ofmodels to approximatethe segments oftime series,such that the pruning and refinement framework can be applied to this segment-representedtimeseriesforthepatternmatchingor sim-ilarity search [9,13].Some other researchers proposed techniques for managing the segment models that approximatesensor time series in relational databases. MauveDB designed a model-based view toabstractunderlyingrawsensordata;itthenusedmodels to project the rawsensor readings onto grids for query process-ing [3]. As opposed to MauveDB, FunctionDB only materializes segment models of raw sensor data [2]. Symbolic operators are designed to process queries using models rather than raw sen-sor data. However, both approaches in [3] and [2] do not take intoaccountapplyingindicestoimprovequeryprocessing. More-over, their proposedapproach focusesonmanaging staticdataset oftimeseriesratherthandynamictimeseries.
Also,eachsegmentoftime seriescouldbecharacterizedbyits timeandvalueintervals.Then, oneshouldconsideremployingan intervalindexforprocessingqueriesovermodel-viewsensordata. Twocommonusedindicesforintervaldataaresegmenttree[21]
andintervaltree[23].Asforsegmenttree,itisessentiallyastatic datastructurethat needsaninitializationprocess toconstruct el-ementary intervalsbased onthe initial dataset [21]. Oncea new intervaloutsideofthedomainofcurrentsegmenttreearrives,the elementary intervals should be rebuilt, which is not suitable for therealtimenatureofsensordatastreams[21].Regardingthe in-tervaltree,individualintervalisnotdividedandreplicatedduring theinsertionphaseasinthesegmenttreeandthereforethe stor-ageoverheadislineartothenumberofintervalstoindex[23–25]. However,itisamemory-orientedstructure.
Some efforts [22,23,26,27] have also been done to external-izethesein-memoryindexdatastructures.The relationalinterval tree(RI-tree)[26]integratesintervaltreeintorelationaltablesand transformsintervalqueriesintoSQLqueriesontables.Thismethod makesefficientuseofbuilt-inB+-treeindicesofRDBMS. Neverthe-less,inthispaper,weaimtodesignanintervalindexstructurefor model-view sensor datathat is compatiblewith key-value stores anddistributedqueryprocessinginthecloud.
In [28], they proposed an approach that enables key-value storesto support queryprocessing ofmulti-dimensional datavia integratingspacefilingorderintorow-keys,whichisnotfitfor in-tervaldata.Anindexformulti-dimensionalpointdataisproposed in[29]basedonP2Poverlaynetwork,whichisadifferent underly-ingarchitecturefromouradoptedkey-valuestore.Thelatesteffort todevelopintervalindicesinthecloudutilizesMapReduceto con-structasegmenttreematerializedinthekey-valuestore[22].This approach outperforms the interval query processing provided by HBase(http://hbase.apache.org/)andHive(http://hive.apache.org/). However, the segment tree utilized in [22] is essentially a static index, as is also the case with [21]. Therefore, an index re-buildingphase needs to be periodically executed to include new data.
MapReduce parallel computing is an effective tool to process largescaleofsensordataincloudstores,butconventional MapRe-duce always conducts trivial full scan on the whole dataset for thequeries ofanyselectivity[19].Inorder toenable MapReduce andindicestocollaborate forqueryprocessing,manyresearchers proposedtointegrateindextechniquesintoMapReduceframework toavoidfulldatascanforlow-selectivequeries[30–34].Oneprior workistoconstructaB+-tree overstaticdatasetinadistributed filesystem(e.g.HDFS)andmaterializetheindexalsoinHDFS[34]. Then, the query processor can process thisindex file, which in-volvesofmultipleMapReducejobstoaccessdifferentlayersofthe treeandthereforeisnotefficientenough[34].Moreover,the pro-posed HDFSbased B+-tree isonly effective for point datarather thanintervaldata[34].
Theauthorsin[31] integrate indicesintoeach filesplitofthe datafile inHDFS, suchthat mappers canuse theindices toonly access predicate-qualified data records in each split. In [32], in-dicesapplicable to querieswith predicatesonmultiple attributes were developed by indexing different record attributesin differ-ent block replicasof data files. These two kinds of index access methods in MapReduce are both on record-level. If we integrate an interval index into each file split guided by this direction, the MapReduce data preparation, starting overhead and Mapper waves are actually not decreased much and therefore the per-formance doesnot significantly improve. On theother hand,the update in one file split requires to rebuiltthe whole local index ofthesplitthereby havinghighlatencyforsegmentinsertions.In
[33],asplit-levelindexisdesignedtodecreaseMapReduce frame-workoverhead.Ascomparedto aboverecord-levelindicesin[31, 32],split-levelindiceseliminateirrelevantfilesplitsbefore launch-ingmappers, andthedata transferringandstarting upoverheads are furtherdecreased.This kindofapproach ismoreefficient for decreasing the map function invocations via overriding the data reader of MapReduce [31,32], as compared to record-level opti-mizations.
3. Overviewofmodel-viewsensordatamanagement
Inthissection,wediscusscertainimportantissuesfor manag-ingmodel-viewsensordataina key-valuestore.First,weexplain whatismodel-viewsensordata.Then,wediscusspossiblestorage schemasformodel-viewsensordata andexplain thenecessityto developaspecialindexforitinkey-valuestores.Last,wedescribe the querytypes of our focusand some particular techniques for processingmodel-viewsensordataqueries.
3.1. Sensortimeseriessegmentation
Sensor time series segmentation is a type of algorithms that fragment a time series stream into disjoint segments, and then approximate each data segment by a mathematical function or model,such that a specific errorbound is satisfied[5–9]. Query
Algorithm 1:Sensortimeseriessegmentation.
Input:vt,t,/*valueandassociatedtimepointofonesensorreading*/
error_bound 1 begin
2 ifcurrentSeg_error(anchor, t) <error_boundthen
3 seg[anchor,t]=segmentModel(vanchor,. . . ,vt);
4 /*approximatethesensorvaluesbetweenthetimerange[anchor,t] withthespecifiedtypeofmodel(e.g.,constant,linear,polynomial, etc.)*/
5 else
6 lt=anchor;
7 lr=t−1;
8 SegMaterialization(seg[anchor,t])
9 /*outputthisnewproducedsegment
seg
[anchor,t]formaterialization */10 anchor=t;
Table 1 Symbollist.
Symbol Semantics
lt Beginning timestamp of one segment model
rt Ending timestamp of one segment model
lv Minimum value of one segment model’s value range
rv Maximum value of one segment model’s value range
pi0. . .pin Coefficient of each item in polynomial model
processing can then be performedon the materialized segments insteadoftherawsensordata,asin[2].Sensordatasegmentation andmodeling algorithmshavebeenextensivelystudiedin[4–9].
Assegmentationalgorithmsarenotthefocusofthispaper,we only presentageneral frameworkin Algorithm 1foronline time series segmentation[7],which isusedin thedatasetpreparation phase ofoursub-sequential experimentalevaluationinSection6.
Table 1summarizesthelistofsymbolsthat weemployformodel segments.
Thissensor timeseriessegmentation programis invokedeach time a new sensorreading comes. It first checks if thesegment modeloverthesensorreadingsfromtheanchortimepointto cur-rent time, namely in range
[
anchor,
t]
, satisfies the error bound. If yes, the current segment model is updated to seg[anchor,t]. If no, the current time t is supposed to be the starting point an-chorofanewsegment.Then,itoutputs thesegmentseg[anchor,t−1] for thematerialization process forwhich our designedindex ap-proach is responsible. The segment seg[anchor,t−1] consists of the following information: time interval[
lt,
rt]
, value range[
lv,
rv]
and model formula (for instance, it could be a sequence of co-efficients for each item of one polynomial function). Under this framework,userscanchoosedifferentcategoriesofmodelsto ap-proximatethesensordatainsegments.Forinstance,atimeseries shownin Fig. 1(a) istransformed intoa sequenceof linear func-tions with time dimension astheindependent variable,which is depictedinFig. 1(b).The associatedvaluerangeinFig. 1(b) indi-cates the values one segment model can cover within the time interval. Such set of functions, each with its associated[
lt,
rt]
and[
lv,
rv]
for the sensor data segments are referred to as the model-viewsensordataovertheoriginalrawsensordata observa-tions.3.2. Storagemodelinkey-valuestores 3.2.1. HBase overview
A table in HBase consists of a set of regions, each of which storesasequentialrangepartitionoftherow-keyspaceofthe ta-ble [18,20]. An HBase cluster consists of a collection of servers, calledregionservers,whichareresponsibleforservingasubsetof regions ofone table.The key-baseddataaccessmethodofHBase isrealizedbyathreelayeredB+treemaintainedamongthenodes
Fig. 1.Model-view sensor data.
Fig. 2.Model-view sensor data in HBase key value store.
ofHBasecluster.ThetwoupperlevelsreferredtoastheROOTand METAregionsarekeptinthemasternodeofan HBasecluster.The ROOTregionplays theroleofrootnodeofoneB+tree,whilethe METAtablemaintainsamappingoftheregionsofatabletoregion servers.Theregionsdistributedamongtheothernodesconstitute thelowestlevel oftheB+-tree indexandstorethereal dataofa table.Ifaregion’ssizeexceedsaconfigurablelimit,HBasecan dy-namicallysplit itintotwo sub-regions,whichenablesthesystem togrowdynamicallyasdataisappended.RegardingaMapReduce job ona table inHBase, thenumber ofmappers isequal by de-faulttothat ofregions ofthetable toprocess,whichmeans that eachmapperprocesses oneregion, asshowninFig. 2.The num-berofreducersisconfigurableprogrammatically.AndHBaseallows MapReduce to only process key-based table partitions of inter-est,which can avoid trivialfull-scan ona table forlow-selective queries[18,20].
3.2.2. Interval-indexinHBase
Forthemodel-viewsensordatainthekey-valuestore,thetime interval,the value range andthemodel formula(e.g. polynomial coefficients) can fully approximate one data segment. There are differentpossiblewaystoorganizerowkeysandcolumnsfor stor-ingandqueryingmodel-viewsensordata.
Oneidea forstoring segments inkey-value storescould be to do it similarly to that of the raw sensor data management sys-temsuchasOpen-TSDB[15],whichtakesthetimeintervalofone segmentasthe rowkey(rk). As rowsare sortedonthe rowkey in the key-value store, the starting points of time intervals are in ascending order. Therefore, although for time range or point queriesthequery processorknowswhento stopthe scan,it still needstostartthescanfromthebeginningofthetableeachtime. The same problem happens to the table with value intervals as row-keys.Forinstance,inFig. 2,whenthevaluerangeisthe row-key, the table is sorted according to the left boundaries of the value rangesof model-viewsensor data.Fora value queryrange
[
9.
2,
10]
,we canmake sureafterthefourthsegment, namely S2 withvalue range[
9.
9,
10.
9]
, thereis noqualified segmentinthe table. However, still scanning has to start from thebeginning of thetable,sincetherightboundariesofthevaluerangesarenotin order.Insummary,simplyincorporatingtime,valueintervalormodel coefficientsintotherow-keycannotcontribute toaccelerating the query processing. A generic indexin key-value stores specialized for model-view sensor data is imperative. Moreover, considering thatmodel-viewsensordataisproducedinreal-time,theindexfor model-viewsensordatainkey-valuestoresshouldbeableto up-dateontheflyefficiently[20].Sinceeachsegmentofmodel-view sensordataischaracterizedby itstimeandvalueranges,we will concentrate on employing interval indexto index both the time andvaluedimensionofonesegment.
However, another issue for designing the interval index for model-view sensor data in key-value stores is where to materi-alizetheindex. One possiblewaywouldbeto storethe indexin separated files in the distributed file system (e.g. HDFS, etc.)on top of which the key-value store HBase isalso built. In this ap-proach, as in the work [34], each layer of the tree structure is storedin one file.Consequently, indexsearchingneeds to invoke multiple MapReduce jobs to iteratively process each level of the tree,whichisnotefficient.Anotherwayistointegrate oneindex for each region to only index the local model-view sensor data in that region, such that mappers can first loadthe indexto lo-catedesireddatainthat regionandthenaccessit.However, this kindofapproachdoesnotdecreasedata preparationandstarting overheadofMapReduceforthewhole table,sincestill all the re-gions of the table are initialized to be processed by MapReduce
[31,32].Ourproposedkey-valuebasedintervalindex(describedin thenext section)steps outofaboveideas andtakesadvantage of boththein-memoryandkey-valuepartstoimprovethequery pro-cessing.
Fig. 3.Two-tier structure of KVI-index.
3.3. Queryingmodel-viewsensordata
Muchworkhasbeendevotedtopatterndiscoveryfortime se-riesdata,butinthispaper, wefocuson thefollowingfour types offundamentalqueriesonmodel-viewsensordata.
•
Time pointquery:returnthevalueofonesensorataspecific timepoint.•
Value pointquery: returnthetimestamps whenthe value of onesensorisequaltothequeryvalue.Theremaybemultiple timestampsofwhichsensorvaluessatisfythequeryvalue.•
Time rangequery:returnthevaluesofonesensorduringthe querytimerange.•
Value range query: return the time intervals of which the sensor values are within the query value range. There may be multipletime intervalsof whichsensor valuessatisfy the queryvaluerange.The generic process to querymodel-view sensordata queries comprisesthefollowingtwosteps:
•
Searching of qualified segments: The qualified segments are thesegments whosetime (resp.value)intervalsintersectthe querytime(resp.value)rangeorpoint.Thisstepshouldmake use of an interval index to locate all the qualified modeled segmentsinthesegmentmodelstore.•
Modelgridding:Qualifiedsegmentsare tooabstractanda fi-nite set of data points are more useful as query results for end-users[2].Therefore,modelgridding isanothernecessary process[2,3].Modelgriddingappliesthreeoperationstoeach qualified segment: (i) It discretizes the time interval of the segment at a specific granularity to generate a set of time points.(ii) Itgeneratesthesensorvaluesatthe discretetime points based on the model that approximates the segment. (iii) Itfiltersoutthesensordatathatdoesnotsatisfythequery predicates.Thequalifiedtimeorvaluepointsthatresultfrom griddingallqualifiedmodeledsegmentsarereturnedasquery results.AsshowninFig. 1(c),segmentone S1 isfoundasonequalified segmentfortimequeryrange
[
10,
70]
,model-griddingdiscretizes thetimeintervalofS1 andevaluatesthevaluesateachtimepoint. Then,thequalifiedtime-valuepairsconstitutethequeryresults.4. Key-valueintervalindex
Wewillfirstpresentthedesignofthetwo-tiermodelindexon key-valuestoresandthenwe willdiscussthe updatingalgorithm ofthemodelindex.
4.1. StructureofKVI-index
Aseachsegmentofmodel-viewsensordatahasaspecifictime and value range, instead of indexing the mathematical functions of segments, our idea isto take the time andvalue intervalsas keys to index each segment, which allows the index to directly servethequeriesproposedinSection3.Weproposethekey-value represented interval index (KVI-index) to index time and value intervals of model-view sensor data. For a segment in Fig. 3(c), the timeandvalue intervalsarerespectivelyindexed bythe KVI-index, asdepictedinFigs. 3(a)and(b).OurKVI-indexadoptsthe idea from interval tree,since in-memoryinterval tree’s primary– secondary structure is convenient for externalization to the key-value store [26]. Furthermore, the searching and scanning algo-rithmoftheintervaltreeissuitablefortheMapReducecomputing paradigm.
Our KVI-index is a novel in-memory and key-value compos-ite index structure. The virtual searching tree (vs-tree) residesin memory,whileanindex-modeltable inthekey-valuestore is de-vised tomaterialize thesecondary structure (SS) ofeach node in
vs-tree.
4.1.1. In-memorystructure
Thein-memoryvirtualsearchingtree(vs-tree)isastandard bi-narysearchtreeshowninFig. 3(a).Eachtime(orvalue)intervalis registeredononlyonenodeofvs-tree:theonewithwhichthe in-terval firstoverlapsalongthesearchingpathfromroot. Thisnode is definedasthe registrationnode
τ
forthisinterval. Eachnode ofvs-tree hasanassociatedsecondary structure(SS), materialized inthekey-valuestore,whichstoresthesubstantialinformationof themodeledsegmentsregisteredatthisnode.Weapplyspace-partitionstrategyforvs-tree.Theheightofthe
vs-tree is denoted by h. We set the value of leftmost leaf node as 0. For negative sensor data values, we use simple shifting to have the data range start from0 for convenience. Then, the do-mainofthevs-tree is
[
0,
R]
and R=
2(h+1)−
2.Thevalue ofrootnode is r
=
2h−
1. During the whole life of KVI-index, only the root value r is kept, since, due to the lightweight computability of the space-partition, the value of each node in the searching path from the root to the node that has the queried point or interval can be calculated in run time. All the operations on vs-tree areperformedinmemoryandare thusvery efficient.As the domains oftime andvalue ofthe sensor data are different, twovs-trees one for times andanother forvalues are kept in mem-ory simultaneously foranswering time andvalue queries respec-tively.
4.1.2. Index-modeltable
Ournovelindex-model compositestorageschemaenables one table notonlytostorethemodeledsegments,butalsoto
materi-Algorithm 2:Time(orvalue)KVI-indexupdating.
Input:[lv,rv],[lt,rt],/*valueandtimeintervalsofonemodeledsegment
Mi,
r
/*M
idenotesthecoefficientsofthemodeledsegment1 begin
2 /*dynamicdomainexpansion
3 if (rt>R)then
4 r=2log(rt+2)−1−1 /*expandtonewrootvalue
5 /*registrationnodesearch
6 node=2logrt+1−1;
7 h =log(node)−1;
8 while(h ≥0)do
9 if(lt≤node and rt≥node)then
10 break;/*nodeistheregistrationnode
11 else 12 if(lt>node)then 13 node=node+2h; 14 if(rt<node)then 15 node=node−2h; 16 h =h −1;
17 /*materializedintotheindex-modeltable.
18 ifthe SS of node has been initializedthen
19 rowkey= node|lt|rt;
20 else
21 rowkey= node|α;
22 insert[lv,rv],[lt,rt],
M
iintothetable.alizethestructuralinformationofthevs-tree,i.e.,theSSsforeach treenode.
The index-model table is shownin Fig. 3(b). Eachrow corre-spondstoonlyonemodeledsegmentofsensordata,e.g.,thedata segment shownin the blackdotted rectangle in Fig. 3(c). A row key consists of the node value and the interval of an indexed modeled segment at that node. The time interval, value interval andthe model coefficientsare all storedin different columnsof the same row. And the SSs of each node correspond to a con-secutive range of tuples in the index-model table. For instance, the rows corresponding to the SSs of node 1, node 5 and node 13 invs-treeareillustrated inFig. 3(b).Analogously,wehavetwo index-modeltablesthatcorrespondtotime andvalue vs-trees re-spectively.
4.2.KVI-indexupdates
The complete modeled segment updating algorithm of KVI-indexisshowninAlgorithm 2.Itincludestwoprocesses:
(1)Registration nodesearching: locate the node
τ
at which one time(value) interval[
lt,
rt]
shouldberegistered.(2)Materialization of modeled segments: construct the row-key based on the
τ
and materialize one modeled segment’s in-formationintothecolumnsofthecorrespondingrow.4.2.1. Registrationnodesearching(rSearch)
Thisalgorithmfirst involvesa domain-expansionprocess to dy-namicallyadjustthedomainofthevs-treeaccordingtothespecific domain of the sensor data. Then, the registration node can be foundonthevalidatedvs-tree.
Lemma 1. For a modeled segment Mi with time (value) interval
[
lt,
rt]
(resp.[
lv,
rv]
), its registration node lies in a tree rooted at 2log(rt+2)−1−
1.Proof.The domain of one tree rooted at 2log(rt+2)−1
−
1 is[
0,
2log(rt+2)−
2]
.As 2log(rt+2)−
2≥
2log(rt+2)−
2=
rt,the reg-istration node of interval
[
lt,
rt]
must be in a tree rooted at 2log(rt+2)−1−
1.2
Fig. 4.Registration node searching of KVI-index.
Lemma2.ForamodeledsegmentMiwithtime(value)interval
(
lt,
rt)
,iftherightend-pointrtsatisfiesrt
>
R,thedomainofvs-treeneedsto expand.Proof. The current vs-tree’s height is h and lt
≤
r, the interval[
lt,
rt]
rides over the root node r. Assume that we do not ex-pand the domain and hang[
lt,
rt]
on r. When a new model Mj with a time (or value) interval[
lt,
rt]
(or[
lv,
rv]
) and lt>
R comes, the root value has to increase to r=
2logrt+2−1−
1(resp. r
= · · ·
) as[
lt,
rt]
(resp.[
lv,
rv]
) intersects no node of cur-rentvs-tree.Then betweenr andr,there isonenodewithvalue 2(h+1)−
1. The interval(
lt
,
rt)
is stored at node r=
2h−
1 and rt>
2(h+1)−
2⇒
rt≥
2(h+1)−
1, therefore registering the in-terval[
lt,
rt]
on node r contradicts withthe interval registration rule.2
UsingLemmas 1 and2,KVI-indexisabletodynamicallydecide whenandhowtoadjustthedomain
[
0,
R]
ofvs-tree.ThecompleterSearchcanbeillustratedbyFig. 4.Whenmodel1istobeinserted, the vs-tree rooted at node7 is still valid. model1 is registered at
node5.However,whenmodel2arrives,its rightend-point,i.e., 16, exceeds thedomain
[
0,
14]
.The vs-treeisexpandedhaving 15 as newroot andthe newextended domain isthe area enclosed by the dotted block in Fig. 4. Then, the sub-sequential model2 andmodel3canbeupdatedsuccessfully.
The rSearch process can be further optimized via adaptively adjusting the starting point based on the interval to index. In this way, rSearch does not need to always search from root node, thereby shortening the length of searching path.Based on
Lemma 1, forone interval
(
lt,
rt)
,we can startto search for reg-istrationnodefromnoder0=
2logtr+1−
1 ratherthanrootnode r,whichcanbereferredtoLine7 inAlgorithm 2.Also,thenodes outsideofsub-domain[
0,
2logtr+1+1−
2]
ofvs-tree will not be-come the registrationnode, becausethe interval(
tl,
tr)
doesnot intersectwiththem.Forexample,themodel4inFig. 4.The adap-tivesearchingfindsthat thesubtreerootedatnode 7 isthe min-imumone covering model4’sinterval[
9,
14]
.Therefore,node 7 is thestartingpointforregistrationnodesearchingandthelengthof searchingpathisonly1.4.2.2. Materializationofmodeledsegment
WhenmaterializingmodelMiintotheSSofanode
τ
,the row-keymaybechosenintwoways:•
UponinitializationoftheSSofnodeτ
:whennomodeled seg-menthasbeenstoredatτ
’sSS,therowkeyischosenasτ
,
α
formodel Mi.Here,
α
isapostfixofrowkeytoindicatethat thisrowisthestartingpositionofτ
’sSSinthetable.•
UponupdatingtheSSofnodeτ
:whentheSSofτ
hasalready beeninitialized,thetimeinterval[
lt,
rt]
(resp.[
lv,
rv]
forvalue interval)tobeindexedwillbeincorporatedintotherowkey, i.e.,τ
,
lt,
rt(resp.τ
,
lv,
rvintheindex-modeltablefor val-ues). In this way, different modeled segments stored in the sameSSofanodedonotoverwriteeachother.Fig. 5.Modeled segment materialization of KVI-index.
The selection ofspecific
α
should make surethat the binary representation ofτ
,
α
is in front of any otherτ
,
lt,
rt. This designisusefulforqueryprocessing.Forinstance,ifthequery pro-cessorrequirestoaccessallthemodeledsegmentsstoredat regis-trationnode5,thenweknowthatallthecorrespondingmodeled segmentslieintherowswithintherow-keyrange[
5,
α
,
6,
α
)
. Forexample,take themodel1andmodel2inFig. 5.First,the KVI-indexcheckswhetherthestartingmodeledsegmentsofnode5andnode15,namelyrowswithkey 5
,
α
and 15,
α
,exist.Then,the rowkey 5,
4,
6isconstructedformodel1astheSSofnode5has been initialized, whilst KVI-index constructs the rowkey 15,
α
formodel2.
4.2.3. Complexityanalysis
In this section, we analyze the computational and communi-cationcomplexitiesofupdating operationofKVI-index.The com-plexity analysis of KVI-index includes in-memory and key-value part.
•
In-memoryThe rSearch on vs-tree can run within O
(
log(
R))
time. The spaceexpansion costs O(
1)
time.Onlytherootvaluer of vs-tree iskept inmemory. The valuesofother nodes onvs-treecanbe calculateddueto thecomputabilityofvs-tree’s space-partition.Therefore,thespacecomplexityofvs-treeis O
(
1)
.•
Inkey-valuestoreThe update operation on vs-tree does not generate any net-work I/O cost. For an n-th order polynomial model of one sensordatasegment,
(
2+
n)
putoperationsareconductedto materializeonemodel.Therefore,thetimecomplexityis O(
1)
intermsofnetworkI/O, asn isconstant.Moreover, one seg-mentmodel’stime(orvalue) intervalandcoefficientsareonly materializedonce intothe index-model table,thus the space costis O
(
N)
fortime(orvalue)index.5. QueryprocessingviaKVI-indexandMapReduce
Forquerying model-viewsensordata,thesearchingprocess of qualifiedmodeledsegments(definedinSection3)inKVI-index in-cludestwosteps:
•
Intersectionandpointsearch:Theintersectionsearchonvs-treeisusedforrange queries,whilepoint search isemployed for point queries. They are responsible for collecting the nodes that accommodate qualified modeled segments in their sec-ondarystructuresSSs.
•
Modeled-segmentfiltering:Duetotheruleforinterval registra-tionatthenodesofthevs-tree,theSSofanodemaycontain some intervals irrelevant to queried range or point. In KVI-index,the SSs ofall nodesfoundby thesearch operationare accessedtofilteroutunqualifiedsegments.Algorithm 3:iSearch+ofvs-tree.
Input: timequeryrange[lt,rt],rootvalue
r
Output: nodesetS0andD1 begin
2 /*constructS0
3 node=r;
h
=log(r)−1;4 while(h ≥0)do
5 if(lt≤node and rt≥node)then
6 break;/*nodeistheregistrationnode
7 else 8 S0=S0∪node 9 if(lt>node)then 10 node=node+2h; 11 if(rt<node)then 12 node=node−2h; 13 h =h −1; 14 /*constructD. 15 ls=2log(lt),
r
s=R −2log(R−rt),D= [ls,rs]Afterthe above two steps, modelgridding component fetches the coefficientsof each qualifiedmodeled segment andperforms gridding. Next, we first describe an enhanced intersectionsearch algorithmonvs-treethatbenefitsKVI–Scan–MapReducequery pro-cessing,introducedlaterinthissection.Wethenpresentthepoint search algorithmonvs-tree.Subsequently,weintroduceournovel hybrid KVI–Scan–MapReduce queryprocessing. Last,we theoreti-cally analyzethe enhanced intersectionsearch algorithm of KVI-index.
5.1. Intersectionandpointsearch
5.1.1. Enhancedintervalintersectionsearch
Algorithm 3 presents the iSearch+. Given a time (resp. value) rangequery
[
lt,
rt]
,iSearch+ firstcallstherSearch tofindthe reg-istration nodeτ
of[
lt,
rt]
.The nodesonthe searchingpathfrom the rootnode to the one precedingτ
form a node set denoted byS
0.TheiSearch+stopsatthenode,whichisclosesttothe left-endpoint lt.Allthenodesalong theleft-descendingpath forma node set,denotedbyS
l,whilethenode withtheminimumvalue inthispathisdenotedbyls.Analogously,S
r isthenode setfrom the right-descendingpath andrs isthe node withthemaximum value inthispath.Anynode outsidetherange[
ls,
rs]
andthesetS
0doesnothaveanyqualifiedmodeledsegments.The traditionalintersection search would returnthe node set
C
=
S
0∪
S
l∪
S
r∪ [
lt,
rt]
for further modeled-segment filtering and gridding. OuriSearch+ outputs thediscrete node setS
0 and a consecutive range of nodesD
= [
ls,
rs]
. For example, take the range query in Fig. 6(a).node7 is the registration node of query range[
6,
10]
.ThetraditionaliSearchreturnsthediscretenodesets shown inthe solid boxes ofFig. 6(a),while our iSearch+ returns a range of nodes[
3,
11]
andS
0= {
15}
. We will see how the output of iSearch+ benefits the hybrid query processing later in Section5.2.5.1.2. Pointsearch
We denotethe point search by sSearch as it functionsas the stabbing search of interval tree. The sSearch is a binary search thatrecordsthenodesalongthedescendingpath.Wepresentthe
sSearch inFig. 6(a).Forexample,whenqueryingthesensorvalue of time point 24,the node set
S
0= {
15,
7,
11,
9,
10}
is returned by sSearch. Since thereis no split searching,as in iSearch+, only one node set is produced here. We denote this node set byS
0 aswell,soastofacilitatethedescriptionofthehybridKVI–Scan– MapReducequeryprocessingthatfollowsnext.Fig. 6.Workflow of KVI–Scan–MapReduce approach. (a)iSearch+andsSearch. (b)SSlocation distribution for the range query. (c) Hybrid processing.
5.2.KVI–Scan–MapReducequeryprocessing
Theconventionalclusteredindexforone-dimensionaldatacan exactly locate a consecutive range of qualified data. Then, the query processor just needs to do range scan on these qualified data.However,asforquerying model-viewsensordata,the chal-lengeis howtotacklethelarge amountofsegmentmodelsfrom the SSs distributed across the index-model table. We first ana-lyze the location distribution of the SSs of the nodes found by
iSearch+ andsSearch intheindex-modeltable. Thecharacteristics ofthis distribution inspired us to propose the hybrid KVI–Scan– MapReducequeryprocessingapproach.
5.2.1. SSlocationdistribution
TherearetwocasesfortheSSdistributionintheindex-model table,describedbelow.
•
S
0:TheSSsofS
0arenon-consecutiveandsparselydistributed in theindex-modeltable. Thenode valueisthe primarypart oftherow-key;thus,thedistancebetweenSSsofS
0 depends on the numerical difference of node values. AsS
0 includes the nodes from root node to the one preceding theτ
, the intra-distancesbetweenanyconsecutivenodesinS
0 are2h−i, where i=
0,
. . . ,
h−
dτ is the position of the node in thedescending search path
S
0 and dτ is the depth ofτ
.Obvi-ously,theintra-distancesin
S
0aregreaterthanthoseforother nodesbelowτ
inthesearchpath.•
D
:TheSSsof[
ls,
rs]
areclusteredaroundtheonesof[
lt,
rt]
in theindex-modeltable.TheSSsof[
lt,
rt]
arealladjacentinthe index-model table.TheSSs of[
ls,
rs]
areboundedbythoseof thesub-treerootedatτ
andthenodesin[
ls,
rs]
areasuperset of thenodes in[
lt,
rt]
.The deeper theregistrationnodeτ
is located, thetighter the setofthe SSsof[
ls,
rs]
over thoseof[
lt,
rt]
.For example, take the time (or value) query range
[
6,
10]
inFig. 6(a). The registration node is node7. Then,
S
0= {
15}
andD
= [
3,
11]
. The sub-treerooted at node7 coversthe node rangeE
= [
0,
14]
andD
⊂
E
.From Fig. 6(b),theSSsofD
are clustered around those of[
6,
10]
and bounded by the SSs ofE
. However,node15’sSSislocatedfarawayfromthoseof
[
3,
11]
.IfSSs of
S
0 andD
are processed via straightforward random access and range scan provided by key-value stores, the entire modeled-segment filtering and gridding processes are conducted locallyattheapplicationside.ForatableofmultipleorhundredsofGBs,thecommunicationandcomputationcostsareprohibitively highfortheapplicationsideevenforlow-selectivequeries.
The modeled segment filtering–gridding processing matches MapReduce’s filtering-aggregation paradigm. Considering the re-search results from[31–33], for CPUnon-intensive workload,I/O cost, network latency and starting-up overhead of mappers are dominant in the execution time of MapReduce programs. If the
SSs of
S
0 andD
are all processed by MapReduce, a lotof time is wasted for mappers that process irrelevant SSs in the index-modeltable.ThisisbecauseMapReducewillaccessthecontinuous regions of the table including the SSs of nodes between theS
0 andD
duetothesequentialdatafeedingmechanisminthe map-ping phase. For example,in Fig. 6(b), theSSs ofD
= [
3,
11]
andS
0= {
15}
are distantinthe table.Hence, MapReducewill launch manyunnecessarymappersfortheirrelevantSSsofnodesbetween 11 and15,inordertoprocesstheSSsofS
0andD
.5.2.2. Hybridmodelfilteringandgridding
As discussed above, simply using range scan or MapReduce to process SSs are both problematic. Ouridea isto design a hy-bridKVI–Scan–MapReduceparadigmthatcombinesrangescanand MapReduceforprocessingSSs,asfollows:
•
(1)S
0: the height ofvs-tree is bounded by log(
R)
,andthus theamountofcomputationonS
0 islimited.AstheSSs ofS
0 aresparselydistributedintheindex-model tableandeachSSof
S
0canbeconsideredtobeasmallrangeofclusteredindex, therandom-access- andrange-scan-basedmodelfilteringand griddingissuitable.•
(2) D= [
ls,
rs]
: the successive range[
ls,
rs]
delimits a tight boundary ofthesub-index-model table overthe relevantSSs thataresuitableforprocessingwithMapReduce.This hybridparadigm eliminates the Map-phaseprocessing of
SSsofirrelevantnodesbetween
S
0andD
andthenodesbetween theelementsofS
0.Moreover,itisnon-intrusiveforboththe key-value store andMapReduce. Regarding the time (or value)point query, it only produces the node setS
0 withoutD
, hence, only range-scan-basedmodelfilteringandgriddingisneeded.Supposethatthenumberofreducersis P andeachreducer is denotedby 0
,
. . . ,
P−
1.Forrangequeries,the partitionfunctionf isusedtoassignthe qualifiedmodeledsegments intodifferent reducers. It is designed on the basis of query time (resp. value) range
[
lt,
rt]
(resp.[
lv,
rv]
) andthetime (or value)interval[
li,
ri]
ofeachmodeled segmenti.The ideaisthateach ofthereducersisincharge ofoneevensub-range rt−lt
P .Suchapartitionfunction f isgiveninEq.(1). f
(
ri)
=
lt≤
ri≤
rt (ri−lt)∗P rt−lt ri≥
rt P−
1 (1) Thefunctionalitiesofmappersandreducersaredepictedin de-tailbelow.•
Mapper: Each mapper gets the time (resp. value) interval[
li,
ri]
of one modeled segment i to check whether it inter-sectswiththequerytime(resp.value)range.Forthequalified modeledsegments,theintermediatekeyisderivedbythe par-tition function f(
ri)
. The model coefficients p1i,
. . . ,
pni are thevaluepartoftheintermediatekey-valuepair.•
Reducer: One reducer receives a list of qualified modeledsegments p1
0
,
. . . ,
pn0,
p11,
. . . ,
pn1,
. . .
.Foreachmodeled seg-ment p1i
,
. . . ,
p ni
,thereducerinvokesamodel-basedgridding functiontocomputediscretevaluesasqueryresults.Regarding the scan-based modelfiltering and gridding, asSSs in
S
0arelocatedindifferentregionsoftheindex-modeltable,the query processor makes use of thread pool to process each SS ofS
0 inparallel.Fig. 6showstheworkflow ofthehybridKVI–Scan– MapReduce approach. For a time (or value) range query[
6,
10]
,iSearch+constructsthenode set
S
0= {
15}
andD
= [
3,
11]
.Then, theSSsofthenodesinD
aresenttoMapReduce.TheSSofnode15, enclosed by the bottomdot-dashed block, is processed viarange scan.5.3. Theoreticalanalysis
One point to carefullyconsider isthat iSearch+ may generate redundant nodes, because the iSearch+ aims to find a tight and consecutiverangeofSSsforMapReduce.Forinstance,inFig. 6(a), theSSofnode4 isnotaccessedbytheiSearch+,butisinthe sub-tableprocessedbyMapReduce.
Theorem1.Forarangequery
[
lt,
rt]
,theredundantnodesin[
ls,
rs]
returnedbyiSearch+arebounded.Proof. Assumehtobetheheightoftheregistrationnode
τ
.Con-sidertheleft-descendingpathfrom
τ
tothenode l0 closest tolt. Letd be the depth from which the descending path turns right, namelythevalue w<
lt ofthe currentnode.Then, basedonthe iSearch+algorithm, wistheleftboundaryofaccessednoderange andw=
τ
−
di=12h−i.
Fora certainvalueofd,theworstcasehappenswhenthe de-scending process continues to go right until reaching l0, as the nodesbetween w andl0 are all redundantones. The number of nodesreturnedbyiSearch+underthiscaseisgivenby:
U
=
τ
−
τ
−
d i=1 2h−i (2) The nodesbetweenτ
and w are allincluded into theoutput rangeD
of iSearch+.The numberof nodesreturned by the con-ventionaliSearchisgivenby:V
=
h−
d+
τ
−
τ
−
d i=1 2h−i+
h i=d+1 2h−i (3)Therefore,thenumberofredundantnodesreturnedbyiSearch+
isgivenby: f
=
U−
V=
d+
h i=d+1 2h−i−
h=
d+
2h−d−
h−
1 (4)Eq. (4) is a function of d and is monotonous decreasing in d’s domain
[
1,
h]
. Consequently,when d=
1, the function f reaches themaximumvalue,namely,thenumberofredundantnodesfromiSearch+attainsthemaximumvalue fmax thatisgivenby:
fmax
=
2h−1−
h.
(5) Asthetotalnumberη
ofnodesofthesub-treeoftheleftchild ofτ
is2h−
1,hencef
≤
12
η
−
log(
η
+
1)
+
12
.
(6)Insummary,thetotalnumberofredundantnodesintherange
[
ls,
rs]
isbounded.2
The worst case happens when the endpoints lt and rt are the preceding andsucceeding nodesof
τ
,namelylt=
τ
−
1 and rt=
τ
+
1.However, formostof thecases, the redundantnodes returnedfromiSearch+areverylimited.6. Experimentalevaluation
First, we compare model-view sensor data query processing with conventional one over raw sensor data. Then, we show that ourKVI–Scan–MapReduce(KSM) approachoutperformsother model-view sensor data querying approaches. Finally, we experi-mentally explore the factors that affect theperformance ofKVI– Scan–MapReduce.
6.1. Setup
We employ accelerometer datafrommobile phones assensor dataset.Thesizeofrawsensordatais22GBincluding200million data points. Aftermodeling,themodeled segments ofthe sensor datatake 12GB,whilethereare around25million modeled seg-ments.
We developed our system using the versions of HBase and HadoopinClouderaCDH4.3.0.The experimentsareperformedon our own cluster that consists of 1 master node and 8 slaves. The master node has 64 GB RAM, 3 TB disk space (4
×
1 TB disks in RAID5) and12 cores, each of which is 2.30 GHz (Intel XeonE5-2630). Eachslavenode has6cores2.30GHz(IntelXeon E5-2630),32 GBRAMand6 TB diskspace(3×
2 TB disks).Nodes are connectedvia 1 GB Ethernet. Inthe experimentalresults, we refer to query selectivity asthe ratio ofthe number of qualified modeledsegmentsoverthatoftotalmodeledsegments.The datasetcontains discreteaccelerometer datafrommobile phonesandisasequenceoftupleseachofwhichhasone times-tamp and three sensor values representing the coordinates. The size of the raw sensor data set is 22 GB including 200 million data points. We simulate the sensor data emission, in order to segmentandupdatesensordataintotheKVM-indexinanonline manner. Weimplementanonlinesensordatasegmentation com-ponent [5] applying the PCA (piecewise constant approximation)
[1],whichapproximates onesegment witha constantvalue (e.g., themeanvalueofthesegment).Sincehowtosegmentandmodel sensordataisnotthefocusofthispaper,othersensortimeseries segmentation approachescouldalsohavebeen appliedhere. Pro-videdthat the segments arecharacterized by thetime andvalue intervals, ourKVM-indexandrelatedquery-processingtechniques are able to manage them efficiently in the cloud. Finally, there
Fig. 7.Sensor data updating performance.
are around 25 million sensordata segments (nearly 15 GB) up-loaded intothe key-value store. Regarding the segment gridding, wechoose1 secondasthetimegranularitywhichis application-specific.
6.2.Indexupdating
Fig. 7showstheaverageupdatingtimeofeachsegmentandthe average insertion time of each raw sensor data point during the datauploading phase. Both time and value index keep relatively stableupdating efficiency.The updating ofthevalue KVI-indexis
alittleslowerthanthetimeKVI-index.As discussedbefore,since the domain ofthe value vs-tree is smallerthan that ofthe time
vs-tree,thevalueindexperformsmoreSSupdatingoperations (dis-cussed in Section 4.2) than the time index and therefore incurs morenetworkI/Ocost.Therawsensordatainsertionisthefastest but the amount of data to update is much larger than model-view approach.This is because model-view sensor data achieves datacompressionovertherawsensordatatherebydecreasingthe amountofdatatoupload.
6.3. Model-viewsensordatavs.rawsensordata
We create two tables, which take the time-stamp and sensor value as the row-keysrespectively, such that the queryrange or pointcanbeusedaskeystolocatethequalifieddatapoints.Then, the query processor invokes the MapReduce to access the large sizeofdatapointsforgettingqueryresults.
Figs. 8(a),(b)and(c)presentthequeryresponsetimesfortime range, value range and point queries respectively. As shown in
Figs. 8(a)and(b),themodel-viewapproachtakesaround30% less time than the raw sensor data method for both time and value rangequeries.Althoughtherawsensordatabasedmethodsapply MapReducetodirectlyaccessthequalifiedtuplesviatherow-key based range scan, the amount of raw sensor data to process is much larger than that of the model-view approach. In Fig. 8(c), theprocessingtimeoftherawdatabasedmethodis2
×
lessthan thatofthemodel-viewoneintimepointqueries,becausetheraw datamethodcanusethequerytimepointasindexkeytodirectly accesstherelevantdatapoints,whileourKSMrequirestoperform model filteringandgridding. For value point queries,theFig. 9.Query performance comparison of different model-view approaches. (a)–(c): time range queries, (d)–(g): value range queries.
viewapproachhasnearly3
×
lesstimethantherawdatamethod. As,normally,thereisalarge sizeofdatapointswiththequeried value,MapReduce isusedtoaccessthisqualifiedsensordataset. Inthemodel-viewapproach,thepointqueryprocessingonlyuses randomaccessandrangescan togetqualifiedmodeled segments forgriddinglocally,andthusitsavesthetimeonstarting MapRe-ducetoaccessdata.6.4. Comparisonofmodel-viewapproaches
There are four baseline approaches for querying model-view sensordata,namely:
MapReduce (MR). This approach utilizes MapReduce without supportfromanyindex.Italwaysworksonthewholeindex-model table to filter the qualified modeled segments in the mapping phaseandperformthemodelgriddinginthereducephase.
Intervaltree(IT).Weimplementedthetraditionalquery process-ingoperationsoftheintervaltree[23,26]byaddinganothertable tostoretheSSofeachnodesortedbytherightend-pointof inter-vals.Eachindex, time orvalue, hastwoassociated tables.During the intersection or point search on vs-tree, the query processor decides whichtable to accessbased onthe relationbetweenthe queryrange(orpoint)andthenodevalue.Inthisway,thequery processor can stop scanning once it encounters one unqualified modeledsegment,duetothemonotonicityofend-pointsof mod-eledsegments. IT makes useofrandomaccessandrangescan to sequentiallyfilterthequalifiedmodeledsegmentsandmake grid-dinglocally.
MapReduce+KVI(MRK).TheideaofMRKistoleverageKVI-index to avoid having MapReduce to process the whole table. In MRK, MapReduce is designed to work over one continuous sub-index-modeltable includingallthe SSs oftheaccessed nodesinsearch
operations. Forinstance, in Fig. 6(a), fora time (or value) range query
[
6,
10]
,MRK invokesMapReduce to work on thesub-table within the row-keyrange[
3,
α
,
16,
α
]
.The same idea applies forpointqueries. Ascomparedto ourhybridKSMapproach,MRKisalightweightindexing-MapReduceplan,asitprocessesmany ir-relevantSSsofnodesbetween
S
0 andD
.Filter ofkey-value store(FKV). Some key-value stores such as HBaseprovideafilterfunctionalitytosupportpredicate-basedrow selection [20]. The filtertransmits the filtering predicate toeach region server and then all servers scan their local datain paral-lel. Afterwards, they return the qualified tuples. Our filter-based queryprocessingalsoworkson theindex-modeltable, asthe fil-teringpredicatescanbedirectlyappliedtothecolumns.Thequery processorwaitsuntilallregionserversfinishscansandthenit re-trieveseach returnedqualifiedmodeledsegment toconduct grid-dinglocally.
6.4.1. Rangequery
Figs. 9(a), (b) and(c) present the performance of time range queries. As depicted in Fig. 9(a), KSM outperforms MRup to 3
×
for thelow-selective time range queries.As the query selectivity increases, the amountof SSs forscan basedprocessingdecreases andthatforMapReduceapproachestheentiretable.Therefore,the responsetime increasesandapproachesthatofMR.The response timeofMRincreaseslittle.Asincreasingqueryselectivityleadsto ascending gridding workloadin reduce phase,theseresultsshow that the overhead from model gridding is not dominant in MR. TheresponsetimeofMRKismorethanthatofKSM,butlessthan thatofMR.AsMRK utilizestheKVI-indextolocalizeaconsecutive sub-index-modeltable covering allthe SSsof nodesfoundby in-tersectionsearch,itprocessesfewermodeled segmentsthanMR’s fulltablescanning.Yet,ascomparedtoKSM,MRKprocessesmoreredundantmodeledsegments. Moreover,asthe sub-tableinMRK
covers a large range,the processing time of MRK increaseslittle forlow-selectivequeries.
Fig. 9(b)exhibitstheperformanceofITandFKV approaches.As
FKV needs to wait for each region server of HBase to finish the localdatascanning, its totalresponse time isa littlelonger than thatofIT approach.Theybothconsumemuchmoretimethanall
MR,MRK andKSM,astheyapply sequentialaccessingofmodeled segments.
Wealsoanalyze thenumberofmodeledsegmentsaccessedby each approach in Fig. 9(c). These experiments show how differ-entaccessmethodsofmodeled segmentsaffectthe performance.
MRworksonthe entiretable,thus, thenumberofaccessed seg-ments is the same. From the point of view of the application, only qualified modeled segments are returned forgridding, thus
FKV processesnoredundantmodeledsegmentsandconsumesthe leastamount ofmodeled segments. Since IT scansthe SSof one node until encountering an unqualified model, the total number ofaccessedsegments is alittle largerthan that ofFKV.OurKSM
processes larger number ofsegments than both IT and FKV due tothecontinuous andredundantrangeof SSs found by iSearch+. However, the results also verify our theoretical analysis that the amountofredundantmodeledsegmentsisbounded.MRKaccesses moresegmentsthanIT,FKV andKSM,asitaddstheSSsbetween
S
0 andD
to forma continuous sub-tableforMapReduce. Refer-ring to Fig. 9(a) andFig. 9(b),although KSM approach consumes moresegments thanIT andFKV,its hybridparadigm isthemost efficient.Figs. 9(d), (e) and (f) present the value range query perfor-mance. Thedifferent queryprocessingapproaches exhibit similar patterns as for the time range queries, so we skip the detailed analysis.
6.4.2. Pointquery
The time and value point query processing performance are showninFig. 10.IT wins bothfor time andvalue point queries. The response time of KSM is a little greater than IT, but out-performs the other approaches, because IT is able to access all qualified modeled segments in one SS. However, the KSM scans thewholeSSentriesofanodetofindthequalifiedones.Because oftheinvocationofMapReduceandredundantmodeledsegments inthesub-table,MRK takesmoretimethanbothITandKSM.But, asMRK does not work on the entiretable as MR does, it takes about2
×
lesstime thanMR. FKV consumes themost time asit needs to wait for server-sidefull table scan before gridding op-erations. Since the size ofthe domain of the sensor data values is smaller than that of the time domain, nodes in value vs-treeaccommodate more modeled segments than time vs-tree nodes. Thus,theresponsetimesofvaluepointqueriesofIT,FKVandKSM
approachesare allmorethanthoseoftimepoint queries.ForMR
andMRK,theprocessingtime differencesbetweentimeandvalue queriesareinsignificant,asthetime spentonmodelfilteringand griddingisnotdominant.
6.5.InsightsintoKVI–Scan–MapReduce
Section 6.5.1 discusses effect of the searching depth on the performances of in-memory iSearch+, sequential scan based and MapReduce based model filtering and gridding. At last, we will see the workload constitutions within the KVI–Scan–MapReduce paradigminSection6.5.2.
6.5.1. Searchingdepth
InFigs. 11, 12 and 13, we presentthe results fromtime and value range queries of 10% selectivity and 10% to 50% iSearch+
depth.The percentageof iSearch+ depth heremeansthe ratioof
Fig. 10.Query performance of point queries.
Fig. 11.Effect oniSearch+.
Fig. 12.Effect on sequential scan.
Fig. 14.Query processing time constitution of time range queries.
Fig. 15.Query processing time constitution of value range queries.
registration node searching depth over the height of vs-tree in
iSearch+.AsiSearch+depthincreases,thenumberofSSsforrange scan based modelprocessing increases. Therefore, the time con-sumed by iSearch+ andrange scan based model processingboth increases, shown inFigs. 11 and 12. As forthe MapReduce part, thedeeperthelevelofregistrationnode
τ
is,thesmallerthespace coveredbythesub-treerootedatτ
is.ThentherangeofSSsinD
istighterover therangeofSSsofqueryrangeandresultsinless redundantSSs for MapReduce, which is theoretically analyzed in Section5.3.FromFig. 13,wecanseeasalientdecreasingtrendof MapReduceprocessingtime.6.5.2. Workloadconstitution
Thisexperimentaimstorevealhowmuchtime theKVI–Scan– MapReduce spends on model gridding, which is a difference of model-view approach from raw sensor data approach. Figs. 14 and 15showtheresultsfromtimeandvalue rangequeriesof se-lectivityfrom10% to50%.
From Figs. 14 and 15, the time on model gridding accounts for 1
/
3–1/
2 of the total query processing time. As the majority of gridding work is done in the reduce phase and the amount of qualified segment models sent to reducers depends on the query selectivity, the time spent on modelgridding increases as thequeryselectivityincreases.Ifthemodelgriddingcan adaptto users’differentrequirementsforqueryresults,theperformanceof theKVI–Scan–MapReduceschemecanbefurtheroptimized.7. Conclusion
To the best of our knowledge, this is the first work to ex-plorethekey-valuerepresentationofan intervalindexfor model-viewbased sensordatamanagement.Differentfromconventional
external-memoryindexstructurewithcomplexnodemergingand splitmechanisms,ourKVI-index,residentpartiallyinmemoryand partially materialized in the key-value store, is easy to maintain inthedynamicsensordatagenerationenvironment.Moreover,we proposed a hybrid queryprocessing approach, namelyKVI–Scan– MapReduce,integratingtheKVI-index,rangescan andMapReduce formodel-view sensordatain key-valuestores. Extensive experi-ments inarealtestbedshowedthat ourapproachoutperformsin terms of query response time and index updating efficiency not onlyqueryprocessingmethodsbasedonrawsensordata,butalso all otherapproachesconsidered basedonmodel-viewsensordata fortime/valuerangeandpointqueries.Asafuturework,weplan to explore howto processtime andvalue compositequeries and joinqueriesbasedontheKVI-index.
Acknowledgement
This work is supported by the European Commission (EC), ProjectFP7-ICT-2011-7-287305OpenIoT.
References
[1]S. Sathe, T.G. Papaioannou, H. Jeung, K. Aberer, A survey of model-based sen-sor data acquisition and management, in: Managing and Mining Sensen-sor Data, Springer, 2013.
[2]A. Thiagarajan, S. Madden, Querying continuous functions in a database system, in: SIGMOD, 2008.
[3]A. Deshpande, S. Madden, MauveDB: supporting model-based user views in database systems, in: SIGMOD, 2006.
[4]T. Papaioannou, M. Riahi, K. Aberer, Towards online multi-model approximation of time series, in: MDM, 2011.
[5]T. Guo, Z. Yan, K. Aberer, An adaptive approach for online segmentation of multi-dimensional mobile data, in: Proc. of MobiDE, SIGMOD Workshop, 2012. [6]H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh, Querying and mining
of time series data: experimental comparison of representations and distance measures, VLDB Endow. 1 (2008) 1542–1552.
[7]E. Keogh, S. Chu, D. Hart, M. Pazzani, An online algorithm for segmenting time series, in: Proceedings of the IEEE International Conference on Data Mining, ICDM 2001, IEEE, 2001, pp. 289–296.
[8]Y. Cai, R. Ng, Indexing spatio-temporal trajectories with Chebyshev polyno-mials, in: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, ACM, 2004, pp. 599–610.
[9]Q. Chen, L. Chen, X. Lian, Y. Liu, J.X. Yu, Indexable PLA for efficient similarity search, in: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB Endowment, 2007, pp. 435–446.
[10]Saket Sathe, Thanasis G. Papaioannou, Hoyoung Jeung, Karl Aberer, A survey of model-based sensor data acquisition and management, in: C.C. Aggarwal (Ed.), Managing and Mining Sensor Data, Springer US, 2013.
[11]A. Bhattacharya, A. Meka, A. Singh, Mist: distributed indexing and querying in sensor networks using statistical models, in: VLDB, 2007.
[12]A. Deshpande, C. Guestrin, S.R. Madden, J.M. Hellerstein, W. Hong, Model-driven data acquisition in sensor networks, in: Proceedings of the Thirtieth In-ternational Conference on Very Large Data Bases, in: VLDB Endowment, vol. 30, 2004, pp. 588–599.
[13]E. Keogh, K. Chakrabarti, M. Pazzani, S. Mehrotra, Locally adaptive dimension-ality reduction for indexing large time series databases, SIGMOD Rec. 30 (2) (2001) 151–162.
[14]J. Shieh, E. Keogh, iSAX: indexing and mining terabyte sized time series, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 623–631.
[15] OpenTSDB,http://opentsdb.net/,2011.
[16]T. Guo, T.G. Papaioannou, K. Aberer, Model-view sensor data management in the cloud, in: 2013 IEEE International Conference on Big Data, IEEE, 2013, pp. 282–290.
[17]T. Guo, T.G. Papaioannou, H. Zhuang, K. Aberer, Online indexing and distributed querying model-view sensor data in the cloud, in: The 19th International Con-ference on Database Systems for Advanced Applications, 2014.
[18]F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chan-dra, A. Fikes, R.E. Gruber, Bigtable: a distributed storage system for structured data, ACM TOCS 26 (2008) 4.
[19]J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM 51 (2008) 107–113.
[20]L. George, HBase: The Definitive Guide, O’Reilly Media, Inc., 2011.
[21]C.P. Kolovson, M. Stonebraker, Segment indexes: dynamic indexing techniques for multi-dimensional interval data, SIGMOD Rec. 20 (2) (1991).
[22]G. Sfakianakis, I. Patlakas, N. Ntarmos, P. Triantafillou, Interval indexing and querying on key-value cloud stores, in: ICDE, 2013.
[23]L. Arge, J. Vitter, Optimal dynamic interval management in external memory, in: 37th Annual Symposium on Foundations of Computer Science, 1996. [24]R. Elmasri, G.T.J. Wuu, Y.-J. Kim, The time index: an access structure for
tem-poral data, in: VLDB, 1990.
[25]T. Bozkaya, Z.M. Özsoyoglu, Indexing valid time intervals, in: DEXA, 1998. [26]H.-P. Kriegel, M. Pötke, T. Seidl, Managing intervals efficiently in
object-relational databases, in: VLDB, 2000.
[27]H.-P. Kriegel, M. Pötke, T. Seidl, Object-relational indexing for general interval relationships, in: Advances in Spatial and Temporal Databases, Springer, 2001. [28]S. Nishimura, S. Das, D. Agrawal, A.E. Abbadi, MD-HBase: a scalable
multi-dimensional data infrastructure for location aware services, in: MDM, 2011.
[29]J. Wang, S. Wu, H. Gao, J. Li, B.C. Ooi, Indexing multi-dimensional data in a cloud system, in: SIGMOD, 2010.
[30]M.-Y. Iu, W. Zwaenepoel, HadoopToSQL: a MapReduce query optimizer, in: Eu-roSys, 2010.
[31]J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing), VLDB Endow. 3 (2010) 515–529.
[32]J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, J. Schad, Only ag-gressive elephants are fast elephants, VLDB Endow. 5 (2012) 1591–1602. [33]M.Y. Eltabakh, F. Özcan, Y. Sismanis, P.J. Haas, H. Pirahesh, J. Vondrak,
Eagle-eyed elephant: split-oriented indexing in Hadoop, in: EDBT, 2013.
[34]H.-C. Yang, D.S. Parker, Traverse: simplified indexing on large map-reduce-merge clusters, in: DASFAA, 2009.