Efficient Indexing and Query Processing of Model-View Sensor Data in the Cloud

(1)

Contents lists available atScienceDirect

Big

Data

Research

www.elsevier.com/locate/bdr

Eﬃcient

Indexing

and

Query

Processing

of

Model-View

Sensor

Data

in

the

Cloud

✩

Tian Guo

a

,

∗

,

Thanasis

G. Papaioannou

b

,

Karl Aberer

a

a_{EPFL, Switzerland}

b_{Center for Research &}_{Technology Hellas}_(CERTH),_Greece

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Article history:

Availableonline11July2014 Keywords: Bigdata Index Key-valuestores MapReduce Approximation Queryoptimization

Asthenumberofsensorsthatpervadeourlives increases(e.g.,environmentalsensors,phonesensors,

etc.),the eﬃcientmanagement ofmassiveamountofsensordata isbecomingincreasinglyimportant.

The inﬁnite nature of sensor data poses a serious challenge for query processing even in a cloud

infrastructure. Traditional raw sensor data management systems based on relational databases lack

scalability toaccommodatelarge-scale sensordata eﬃciently.Thus,distributedkey-valuestores inthe

cloud arebecomingaprimetool tomanagesensordata.Model-viewsensordata management,which

stores the sensor data in the form of modeled segments, brings the additional advantages of data

compression and value interpolation. However,currently thereare notechniquesfor indexing and/or

query optimizationof the model-view sensor data in the cloud; full table scan is neededfor query

processing in the worst case. In thispaper, we propose an innovative index for modeled segments

in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and

sensorvaluedimensionsrespectively,eachofwhichhasanin-memorysearchtreeandasecondarylist

materialized inthekey-valuestore.Then,weintroduceaKVI-index–Scan–MapReducehybridapproach

toperformeﬃcientqueryprocessinguponmodeleddatastreams.Asprovedbyaseriesofexperiments

ataprivatecloudinfrastructure,ourapproachoutperformsinquery-responsetimeandindex-updating

eﬃciencybothHadoop-basedparallelprocessingoftherawsensordataandmultiplealternativeindexing

approachesofmodel-viewdata.

1. Introduction

Recent advances in sensor technology have enabled the vast deploymentofsensorsembeddedinuserdevicesthatmonitor var-iousphenomenafordifferentapplicationsofinterest,e.g., air/elec-trosmogpollution,radiation,earlyearthquakedetection,soil mois-ture, permafrost melting, etc. The data streams generated by a large numberof sensors are representedastime series inwhich each data point is associated with a time-stamp and a sensor value.Theserawdiscreteobservationsaretakenastheﬁrstcitizen in traditional relationalsensor data management systems, which leadstoanumberofproblems.Onone hand,inordertoperform analysisof the raw sensordata, users usually adopt other third-partymodelingtools(e.g.,Matlab,R andMathematica),which in-volveof tediousandtime-consumingdata extract,transform and

✩ _This_article_belongs_to_Scalable_Computing_for_Big_Data.

*

Correspondingauthor.

E-mail addresses:tian.guo@epﬂ.ch(T. Guo),[email protected] (T.G. Papaioannou),karl.aberer@epﬂ.ch(K. Aberer).

load (ETL) processes [1]. Moreover, such data analysis tools are usually used for static data set and therefore cannot be applied for onlineprocessingof sensordata streams. Onthe other hand, unboundedsensordatastreamsoftenhavemissingvaluesand un-knownerrors,whichalsoposesgreatchallengesfortraditionalraw sensordatamanagement.

To this end, various model-based sensor data management techniques [1–4] have been proposed. Model-view sensor data management leverage time series approximation and modeling techniquestosegmentthecontinuoussensortime seriesinto dis-joint intervalssuch that each interval can be approximatedby a kind ofmodel, such aspolynomial, linearor SVD. These models, foralltheintervals(orsegmentmodels),exploittheinherent cor-relations (e.g.withtimeoramongdatastreams)inthesegments ofsensortime seriesandapproximateeachsegment byacertain mathematical function within a certain error bound [5–9].Then, one can only materialize the models of the segments instead of therawdataandharvestanumberofbeneﬁt:

First, model-view sensor data achieves compression over raw sensor dataandthereforerequireslessstorage overhead[10–12]. Second, due to the sampling frequency or sensor malfunction,

(2)

there may be some missing values at some time points. If one queryinvolvessuchtimepoints,thentherelevantsegmentmodel can be used to estimate the values [10–12]. In some degree, model-view sensor data increases the data availability for query processing. Third, there usually exist outliers inraw sensordata, whichhas negative effect on the query results.Model-view sen-sor data removes the outliers in each interval via the segment modelandhistoricaldataonupperandlowerdatabounds,thereby diminishingtheeffectofoutliersinqueryresults.Fourth, regard-ing the similarity search or pattern discovery in sensor time-seriesmining, the segment-based time seriesrepresentation is a powerfultool fordimension reduction andsearch spacepruning

[9,13,14].

However, proposed model-basedsensordata management ap-proaches mostly employ the relational data model and process queriesbased onmaterialized views[2,3] ontop ofthe modeled segmentsofsensordata.Nowadays,theamountofdataproduced bysensorsisexponentiallyincreasing.Moreover,thereal-time pro-ductionofsensordatarequiresthedatamanagementsystemtobe abletohandlehigh-concurrentmodel-viewsensordatafrom mas-sivesensorsandthisisdiﬃcultfortraditionalrelationaldatabase torealize.Tothisend,recentprevalentcloudstoreandcomputing techniquesprovideapromisingwaytomanagemodel-viewsensor data[15–19].

The main focus of this paper is on how to manage the seg-mentmodelsofsensordata,namelymodel-viewsensordatawith thenewlyemergingcloudstoresandcomputingtechniquesrather thanhowtoexploremoreadvancedsensordatasegmentation al-gorithms.

Inour approach, we exploit key-value stores andthe MapRe-duceparallelcomputingparadigm [18,20],twosigniﬁcant aspects ofcloudcomputing, torealizeindexingandqueryingmodel-view sensordata inthe cloud.We characterize the modeled segments ofsensortime seriesbythe time andthevalue intervalsofeach segment[16,17].Consequently,inordertoprocessrangeorpoint queries onmodel-view sensor data,our indexin the cloud store shouldexcelinprocessingintervaldata.Currentkey-valuebuilt-in indicesdonotsupportinterval-relatedoperations.Theinterval in-dex formodel-view sensor data should not only work forstatic data, but it should be dynamically updated based on the new arrivingsegments ofsensor data.Iftraditional batch-updatingor periodicalre-buildingstrategy wasapplied here[21,22],then the highspeed of sensor data yielding might lead to a large size of thenewunindexeddata,eveninshorttimeperiodsandto signif-icant indexupdating delay as well. As a result, theperformance ofqueries involvingboth indexedandunindexed datawould de-generategreatly. Therefore, the interval index in the cloud store shouldbeabletoinsertanindividualnewmodeledsegmentinan onlinemanner.

Thecontributionsofthispaperaresummarizedasfollows:

•

Innovative interval index: We propose an innovative interval index for model-view sensor data management in key-value stores, referred toas KVI-index. KVI-indexis a two-tier struc-tureconsistingofonelightweightandmemory-residentbinary searchtreeandoneindex-modeltablematerializedinthe key-value store. This composite index structure can dynamically accommodatenewsensordatasegmentsveryeﬃciently.

•

Hybrid model-view query processing: After exploring the searchoperationsinthein-memorystructureoftheKVI-index forrangeandpointqueries,weintroduceahybridquery pro-cessingapproachthatintegratesrangescanandMapReduceto processthesegmentsinparallel.

•

Intersection search: We introduce an enhanced intersection search algorithm (iSearch+) that produces consecutiveresults suitable for MapReduce processing. We theoretically analyze

theeﬃciencyof(iSearch+)andﬁndtheboundon the redun-dantindexnodesthatitreturns.

•

Experimentalevaluation:Ourframeworkhasbeenfully imple-mented,includingtheonline sensordatasegmentation, mod-eling, KVI-index andthe hybrid queryprocessing, andit has beenthoroughlyevaluatedagainstasignificantnumberof al-ternative approaches. Asexperimentally shownbasedon real sensor data,our approach significantly outperforms in terms ofqueryresponsetimeandindexupdatingefficiencyallother onesforansweringtime/valuepointandrangequeries. Theremainder ofthispaperisorganizedasfollows:Section 2

summarizessome relatedwork on model-viewsensor data man-agement,intervalindexandindex-based MapReduceoptimization approaches.InSection3,we provideabriefdescriptionof sensor-datasegmentation, queryingmodel-view sensordataandthe ne-cessitytodevelop intervalindexformanaging model-viewsensor data in key-value stores. The detailed designs of our innovative KVI-indexandthehybridqueryprocessingapproacharediscussed in Sections 4 and 5 respectively. Then, in Section 6, we present thoroughexperimentalresultstoevaluateourapproachwith tradi-tionalqueryprocessingonesonbothrawsensordataandmodeled datasegments.Finally,inSection7,weconcludeourwork.

2. Relatedwork

Time series segmentation is an important research problem in theareas ofdata approximation, dataindexing anddata min-ing. A lot of work has been devoted to exploit different types ofmodels to approximatethe segments oftime series,such that the pruning and reﬁnement framework can be applied to this segment-representedtimeseriesforthepatternmatchingor sim-ilarity search [9,13].Some other researchers proposed techniques for managing the segment models that approximatesensor time series in relational databases. MauveDB designed a model-based view toabstractunderlyingrawsensordata;itthenusedmodels to project the rawsensor readings onto grids for query process-ing [3]. As opposed to MauveDB, FunctionDB only materializes segment models of raw sensor data [2]. Symbolic operators are designed to process queries using models rather than raw sen-sor data. However, both approaches in [3] and [2] do not take intoaccountapplyingindicestoimprovequeryprocessing. More-over, their proposedapproach focusesonmanaging staticdataset oftimeseriesratherthandynamictimeseries.

Also,eachsegmentoftime seriescouldbecharacterizedbyits timeandvalueintervals.Then, oneshouldconsideremployingan intervalindexforprocessingqueriesovermodel-viewsensordata. Twocommonusedindicesforintervaldataaresegmenttree[21]

andintervaltree[23].Asforsegmenttree,itisessentiallyastatic datastructurethat needsaninitializationprocess toconstruct el-ementary intervalsbased onthe initial dataset [21]. Oncea new intervaloutsideofthedomainofcurrentsegmenttreearrives,the elementary intervals should be rebuilt, which is not suitable for therealtimenatureofsensordatastreams[21].Regardingthe in-tervaltree,individualintervalisnotdividedandreplicatedduring theinsertionphaseasinthesegmenttreeandthereforethe stor-ageoverheadislineartothenumberofintervalstoindex[23–25]. However,itisamemory-orientedstructure.

Some efforts [22,23,26,27] have also been done to external-izethesein-memoryindexdatastructures.The relationalinterval tree(RI-tree)[26]integratesintervaltreeintorelationaltablesand transformsintervalqueriesintoSQLqueriesontables.Thismethod makeseﬃcientuseofbuilt-inB+-treeindicesofRDBMS. Neverthe-less,inthispaper,weaimtodesignanintervalindexstructurefor model-view sensor datathat is compatiblewith key-value stores anddistributedqueryprocessinginthecloud.

(3)

In [28], they proposed an approach that enables key-value storesto support queryprocessing ofmulti-dimensional datavia integratingspaceﬁlingorderintorow-keys,whichisnotﬁtfor in-tervaldata.Anindexformulti-dimensionalpointdataisproposed in[29]basedonP2Poverlaynetwork,whichisadifferent underly-ingarchitecturefromouradoptedkey-valuestore.Thelatesteffort todevelopintervalindicesinthecloudutilizesMapReduceto con-structasegmenttreematerializedinthekey-valuestore[22].This approach outperforms the interval query processing provided by HBase(http://hbase.apache.org/)andHive(http://hive.apache.org/). However, the segment tree utilized in [22] is essentially a static index, as is also the case with [21]. Therefore, an index re-buildingphase needs to be periodically executed to include new data.

MapReduce parallel computing is an effective tool to process largescaleofsensordataincloudstores,butconventional MapRe-duce always conducts trivial full scan on the whole dataset for thequeries ofanyselectivity[19].Inorder toenable MapReduce andindicestocollaborate forqueryprocessing,manyresearchers proposedtointegrateindextechniquesintoMapReduceframework toavoidfulldatascanforlow-selectivequeries[30–34].Oneprior workistoconstructaB+-tree overstaticdatasetinadistributed filesystem(e.g.HDFS)andmaterializetheindexalsoinHDFS[34]. Then, the query processor can process thisindex file, which in-volvesofmultipleMapReducejobstoaccessdifferentlayersofthe treeandthereforeisnotefficientenough[34].Moreover,the pro-posed HDFSbased B+-tree isonly effective for point datarather thanintervaldata[34].

Theauthorsin[31] integrate indicesintoeach filesplitofthe datafile inHDFS, suchthat mappers canuse theindices toonly access predicate-qualified data records in each split. In [32], in-dicesapplicable to querieswith predicatesonmultiple attributes were developed by indexing different record attributesin differ-ent block replicasof data files. These two kinds of index access methods in MapReduce are both on record-level. If we integrate an interval index into each file split guided by this direction, the MapReduce data preparation, starting overhead and Mapper waves are actually not decreased much and therefore the per-formance doesnot significantly improve. On theother hand,the update in one file split requires to rebuiltthe whole local index ofthesplitthereby havinghighlatencyforsegmentinsertions.In

[33],asplit-levelindexisdesignedtodecreaseMapReduce frame-workoverhead.Ascomparedto aboverecord-levelindicesin[31, 32],split-levelindiceseliminateirrelevantﬁlesplitsbefore launch-ingmappers, andthedata transferringandstarting upoverheads are furtherdecreased.This kindofapproach ismoreeﬃcient for decreasing the map function invocations via overriding the data reader of MapReduce [31,32], as compared to record-level opti-mizations.

3. Overviewofmodel-viewsensordatamanagement

Inthissection,wediscusscertainimportantissuesfor manag-ingmodel-viewsensordataina key-valuestore.First,weexplain whatismodel-viewsensordata.Then,wediscusspossiblestorage schemasformodel-viewsensordata andexplain thenecessityto developaspecialindexforitinkey-valuestores.Last,wedescribe the querytypes of our focusand some particular techniques for processingmodel-viewsensordataqueries.

3.1. Sensortimeseriessegmentation

Sensor time series segmentation is a type of algorithms that fragment a time series stream into disjoint segments, and then approximate each data segment by a mathematical function or model,such that a speciﬁc errorbound is satisﬁed[5–9]. Query

Algorithm 1:Sensortimeseriessegmentation.

Input:vt,t,/*valueandassociatedtimepointofonesensorreading*/

error_bound 1 begin

2 ifcurrentSeg_error(anchor, t) <error_boundthen

3 seg[anchor,t]=segmentModel(vanchor,. . . ,vt);

4 /*approximatethesensorvaluesbetweenthetimerange[anchor,t] withthespeciﬁedtypeofmodel(e.g.,constant,linear,polynomial, etc.)*/

5 else

6 lt=anchor;

7 lr=t−1;

8 SegMaterialization(seg_[anchor,t])

9 /*outputthisnewproducedsegment

seg

[anchor,t]formaterialization */

10 anchor=t;

Table 1 Symbollist.

Symbol Semantics

lt Beginning timestamp of one segment model

rt Ending timestamp of one segment model

lv Minimum value of one segment model’s value range

rv Maximum value of one segment model’s value range

pi0. . .pin Coeﬃcient of each item in polynomial model

processing can then be performedon the materialized segments insteadoftherawsensordata,asin[2].Sensordatasegmentation andmodeling algorithmshavebeenextensivelystudiedin[4–9].

Assegmentationalgorithmsarenotthefocusofthispaper,we only presentageneral frameworkin Algorithm 1foronline time series segmentation[7],which isusedin thedatasetpreparation phase ofoursub-sequential experimentalevaluationinSection6.

Table 1summarizesthelistofsymbolsthat weemployformodel segments.

Thissensor timeseriessegmentation programis invokedeach time a new sensorreading comes. It ﬁrst checks if thesegment modeloverthesensorreadingsfromtheanchortimepointto cur-rent time, namely in range

[

anchor

,

t

]

, satisﬁes the error bound. If yes, the current segment model is updated to seg_[anchor,t]. If no, the current time t is supposed to be the starting point an-chorofanewsegment.Then,itoutputs thesegmentseg_[anchor,t−1] for thematerialization process forwhich our designedindex ap-proach is responsible. The segment seg_[anchor,t−1] consists of the following information: time interval

[

lt

,

rt

]

, value range

[

lv

,

rv

]

and model formula (for instance, it could be a sequence of co-eﬃcients for each item of one polynomial function). Under this framework,userscanchoosedifferentcategoriesofmodelsto ap-proximatethesensordatainsegments.Forinstance,atimeseries shownin Fig. 1(a) istransformed intoa sequenceof linear func-tions with time dimension astheindependent variable,which is depictedinFig. 1(b).The associatedvaluerangeinFig. 1(b) indi-cates the values one segment model can cover within the time interval. Such set of functions, each with its associated

[

lt

,

rt

]

and

[

lv

,

rv

]

for the sensor data segments are referred to as the model-viewsensordataovertheoriginalrawsensordata observa-tions.

3.2. Storagemodelinkey-valuestores 3.2.1. HBase overview

A table in HBase consists of a set of regions, each of which storesasequentialrangepartitionoftherow-keyspaceofthe ta-ble [18,20]. An HBase cluster consists of a collection of servers, calledregionservers,whichareresponsibleforservingasubsetof regions ofone table.The key-baseddataaccessmethodofHBase isrealizedbyathreelayeredB+treemaintainedamongthenodes

(4)

Fig. 1.Model-view sensor data.

Fig. 2.Model-view sensor data in HBase key value store.

ofHBasecluster.ThetwoupperlevelsreferredtoastheROOTand METAregionsarekeptinthemasternodeofan HBasecluster.The ROOTregionplays theroleofrootnodeofoneB+tree,whilethe METAtablemaintainsamappingoftheregionsofatabletoregion servers.Theregionsdistributedamongtheothernodesconstitute thelowestlevel oftheB+-tree indexandstorethereal dataofa table.Ifaregion’ssizeexceedsaconﬁgurablelimit,HBasecan dy-namicallysplit itintotwo sub-regions,whichenablesthesystem togrowdynamicallyasdataisappended.RegardingaMapReduce job ona table inHBase, thenumber ofmappers isequal by de-faulttothat ofregions ofthetable toprocess,whichmeans that eachmapperprocesses oneregion, asshowninFig. 2.The num-berofreducersisconﬁgurableprogrammatically.AndHBaseallows MapReduce to only process key-based table partitions of inter-est,which can avoid trivialfull-scan ona table forlow-selective queries[18,20].

3.2.2. Interval-indexinHBase

Forthemodel-viewsensordatainthekey-valuestore,thetime interval,the value range andthemodel formula(e.g. polynomial coeﬃcients) can fully approximate one data segment. There are differentpossiblewaystoorganizerowkeysandcolumnsfor stor-ingandqueryingmodel-viewsensordata.

Oneidea forstoring segments inkey-value storescould be to do it similarly to that of the raw sensor data management sys-temsuchasOpen-TSDB[15],whichtakesthetimeintervalofone segmentasthe rowkey(rk). As rowsare sortedonthe rowkey in the key-value store, the starting points of time intervals are in ascending order. Therefore, although for time range or point queriesthequery processorknowswhento stopthe scan,it still needstostartthescanfromthebeginningofthetableeachtime. The same problem happens to the table with value intervals as row-keys.Forinstance,inFig. 2,whenthevaluerangeisthe row-key, the table is sorted according to the left boundaries of the value rangesof model-viewsensor data.Fora value queryrange

[

9

.

2

,

10

]

,we canmake sureafterthefourthsegment, namely S2 withvalue range

[

9

.

9

,

10

.

9

]

, thereis noqualiﬁed segmentinthe table. However, still scanning has to start from thebeginning of thetable,sincetherightboundariesofthevaluerangesarenotin order.

Insummary,simplyincorporatingtime,valueintervalormodel coefficientsintotherow-keycannotcontribute toaccelerating the query processing. A generic indexin key-value stores specialized for model-view sensor data is imperative. Moreover, considering thatmodel-viewsensordataisproducedinreal-time,theindexfor model-viewsensordatainkey-valuestoresshouldbeableto up-dateontheflyefficiently[20].Sinceeachsegmentofmodel-view sensordataischaracterizedby itstimeandvalueranges,we will concentrate on employing interval indexto index both the time andvaluedimensionofonesegment.

However, another issue for designing the interval index for model-view sensor data in key-value stores is where to materi-alizetheindex. One possiblewaywouldbeto storethe indexin separated files in the distributed file system (e.g. HDFS, etc.)on top of which the key-value store HBase isalso built. In this ap-proach, as in the work [34], each layer of the tree structure is storedin one file.Consequently, indexsearchingneeds to invoke multiple MapReduce jobs to iteratively process each level of the tree,whichisnotefficient.Anotherwayistointegrate oneindex for each region to only index the local model-view sensor data in that region, such that mappers can first loadthe indexto lo-catedesireddatainthat regionandthenaccessit.However, this kindofapproachdoesnotdecreasedata preparationandstarting overheadofMapReduceforthewhole table,sincestill all the re-gions of the table are initialized to be processed by MapReduce

[31,32].Ourproposedkey-valuebasedintervalindex(describedin thenext section)steps outofaboveideas andtakesadvantage of boththein-memoryandkey-valuepartstoimprovethequery pro-cessing.

(5)

Fig. 3.Two-tier structure of KVI-index.

3.3. Queryingmodel-viewsensordata

Muchworkhasbeendevotedtopatterndiscoveryfortime se-riesdata,butinthispaper, wefocuson thefollowingfour types offundamentalqueriesonmodel-viewsensordata.

•

Time pointquery:returnthevalueofonesensorataspeciﬁc timepoint.

•

Value pointquery: returnthetimestamps whenthe value of onesensorisequaltothequeryvalue.Theremaybemultiple timestampsofwhichsensorvaluessatisfythequeryvalue.

•

Time rangequery:returnthevaluesofonesensorduringthe querytimerange.

•

Value range query: return the time intervals of which the sensor values are within the query value range. There may be multipletime intervalsof whichsensor valuessatisfy the queryvaluerange.

The generic process to querymodel-view sensordata queries comprisesthefollowingtwosteps:

•

Searching of qualified segments: The qualified segments are thesegments whosetime (resp.value)intervalsintersectthe querytime(resp.value)rangeorpoint.Thisstepshouldmake use of an interval index to locate all the qualified modeled segmentsinthesegmentmodelstore.

•

Modelgridding:Qualifiedsegmentsare tooabstractanda fi-nite set of data points are more useful as query results for end-users[2].Therefore,modelgridding isanothernecessary process[2,3].Modelgriddingappliesthreeoperationstoeach qualified segment: (i) It discretizes the time interval of the segment at a specific granularity to generate a set of time points.(ii) Itgeneratesthesensorvaluesatthe discretetime points based on the model that approximates the segment. (iii) Itfiltersoutthesensordatathatdoesnotsatisfythequery predicates.Thequalifiedtimeorvaluepointsthatresultfrom griddingallqualifiedmodeledsegmentsarereturnedasquery results.

AsshowninFig. 1(c),segmentone S1 isfoundasonequaliﬁed segmentfortimequeryrange

[

10

,

70

]

,model-griddingdiscretizes thetimeintervalofS1 andevaluatesthevaluesateachtimepoint. Then,thequaliﬁedtime-valuepairsconstitutethequeryresults.

4. Key-valueintervalindex

Wewillﬁrstpresentthedesignofthetwo-tiermodelindexon key-valuestoresandthenwe willdiscussthe updatingalgorithm ofthemodelindex.

4.1. StructureofKVI-index

Aseachsegmentofmodel-viewsensordatahasaspeciﬁctime and value range, instead of indexing the mathematical functions of segments, our idea isto take the time andvalue intervalsas keys to index each segment, which allows the index to directly servethequeriesproposedinSection3.Weproposethekey-value represented interval index (KVI-index) to index time and value intervals of model-view sensor data. For a segment in Fig. 3(c), the timeandvalue intervalsarerespectivelyindexed bythe KVI-index, asdepictedinFigs. 3(a)and(b).OurKVI-indexadoptsthe idea from interval tree,since in-memoryinterval tree’s primary– secondary structure is convenient for externalization to the key-value store [26]. Furthermore, the searching and scanning algo-rithmoftheintervaltreeissuitablefortheMapReducecomputing paradigm.

Our KVI-index is a novel in-memory and key-value compos-ite index structure. The virtual searching tree (vs-tree) residesin memory,whileanindex-modeltable inthekey-valuestore is de-vised tomaterialize thesecondary structure (SS) ofeach node in

vs-tree.

4.1.1. In-memorystructure

Thein-memoryvirtualsearchingtree(vs-tree)isastandard bi-narysearchtreeshowninFig. 3(a).Eachtime(orvalue)intervalis registeredononlyonenodeofvs-tree:theonewithwhichthe in-terval ﬁrstoverlapsalongthesearchingpathfromroot. Thisnode is deﬁnedasthe registrationnode

τ

forthisinterval. Eachnode ofvs-tree hasanassociatedsecondary structure(SS), materialized inthekey-valuestore,whichstoresthesubstantialinformationof themodeledsegmentsregisteredatthisnode.

Weapplyspace-partitionstrategyforvs-tree.Theheightofthe

vs-tree is denoted by h. We set the value of leftmost leaf node as 0. For negative sensor data values, we use simple shifting to have the data range start from0 for convenience. Then, the do-mainofthevs-tree is

[

0

,

R

]

and R

=

2(h+1)

−

_2._The_value _of_root

node is r

=

2h

−

1. During the whole life of KVI-index, only the root value r is kept, since, due to the lightweight computability of the space-partition, the value of each node in the searching path from the root to the node that has the queried point or interval can be calculated in run time. All the operations on vs-tree areperformedinmemoryandare thusvery eﬃcient.As the domains oftime andvalue ofthe sensor data are different, two

vs-trees one for times andanother forvalues are kept in mem-ory simultaneously foranswering time andvalue queries respec-tively.

4.1.2. Index-modeltable

Ournovelindex-model compositestorageschemaenables one table notonlytostorethemodeledsegments,butalsoto

(6)

materi-Algorithm 2:Time(orvalue)KVI-indexupdating.

Input:[lv,rv],[lt,rt],/*valueandtimeintervalsofonemodeledsegment

Mi,

r

/*

M

idenotesthecoeﬃcientsofthemodeledsegment

1 begin

2 /*dynamicdomainexpansion

3 if (rt>R)then

4 r=2log(rt+2)−1₋_{1 /*}_expand_to_new_root_value

5 /*registrationnodesearch

6 node=2logrt+1−_1;

7 h =log(node)−1;

8 while(h ≥0)do

9 if(lt≤node and rt≥node)then

10 break;/*nodeistheregistrationnode

11 else 12 if(lt>node)then 13 node=node+2h_; 14 if(rt<node)then 15 node=node−2h_; 16 h =h −1;

17 /*materializedintotheindex-modeltable.

18 ifthe SS of node has been initializedthen

19 rowkey= node|lt|rt;

20 else

21 rowkey= node|α;

22 insert[lv,rv],[lt,rt],

M

iintothetable.

alizethestructuralinformationofthevs-tree,i.e.,theSSsforeach treenode.

The index-model table is shownin Fig. 3(b). Eachrow corre-spondstoonlyonemodeledsegmentofsensordata,e.g.,thedata segment shownin the blackdotted rectangle in Fig. 3(c). A row key consists of the node value and the interval of an indexed modeled segment at that node. The time interval, value interval andthe model coeﬃcientsare all storedin different columnsof the same row. And the SSs of each node correspond to a con-secutive range of tuples in the index-model table. For instance, the rows corresponding to the SSs of node 1, node 5 and node 13 invs-treeareillustrated inFig. 3(b).Analogously,wehavetwo index-modeltablesthatcorrespondtotime andvalue vs-trees re-spectively.

4.2.KVI-indexupdates

The complete modeled segment updating algorithm of KVI-indexisshowninAlgorithm 2.Itincludestwoprocesses:

(1)Registration nodesearching: locate the node

τ

at which one time(value) interval

[

lt

,

rt

]

shouldberegistered.

(2)Materialization of modeled segments: construct the row-key based on the

τ

and materialize one modeled segment’s in-formationintothecolumnsofthecorrespondingrow.

4.2.1. Registrationnodesearching(rSearch)

Thisalgorithmﬁrst involvesa domain-expansionprocess to dy-namicallyadjustthedomainofthevs-treeaccordingtothespeciﬁc domain of the sensor data. Then, the registration node can be foundonthevalidatedvs-tree.

Lemma 1. For a modeled segment Mi with time (value) interval

[

lt

,

rt

]

(resp.

[

lv

,

rv

]

), its registration node lies in a tree rooted at 2log(rt+2)−1

−

₁_.

Proof.The domain of one tree rooted at 2log(rt+2)−1

−

_{1 is}

[

0

,

2log(rt+2)

−

₂

]

_._As ₂log(rt+2)

−

₂

≥

₂log(rt+2)

−

₂

=

_r

t,the reg-istration node of interval

[

lt

,

rt

]

must be in a tree rooted at 2log(rt+2)−1

−

_1.

2

Fig. 4.Registration node searching of KVI-index.

Lemma2.ForamodeledsegmentMiwithtime(value)interval

(

lt

,

rt

)

,

iftherightend-pointrtsatisﬁesrt

>

R,thedomainofvs-treeneedsto expand.

Proof. The current vs-tree’s height is h and lt

≤

r, the interval

[

lt

,

rt

]

rides over the root node r. Assume that we do not ex-pand the domain and hang

[

lt

,

rt

]

on r. When a new model Mj with a time (or value) interval

[

lt

,

rt

]

(or

[

lv

,

rv

]

) and lt

>

R comes, the root value has to increase to r

=

2logrt+2−1

−

₁

(resp. r

= · · ·

) as

[

l_t

,

r_t

]

(resp.

[

l_v

,

r_v

]

) intersects no node of cur-rentvs-tree.Then betweenr andr,there isonenodewithvalue 2(h+1)

−

_1. _The _interval

(

_l

t

,

rt

)

is stored at node r

=

2h

−

1 and rt

>

2(h+1)

−

2

⇒

rt

≥

2(h+1)

−

1, therefore registering the in-terval

[

lt

,

rt

]

on node r contradicts withthe interval registration rule.

2

UsingLemmas 1 and2,KVI-indexisabletodynamicallydecide whenandhowtoadjustthedomain

[

0

,

R

]

ofvs-tree.Thecomplete

rSearchcanbeillustratedbyFig. 4.Whenmodel1istobeinserted, the vs-tree rooted at node7 is still valid. model1 is registered at

node5.However,whenmodel2arrives,its rightend-point,i.e., 16, exceeds thedomain

[

0

,

14

]

.The vs-treeisexpandedhaving 15 as newroot andthe newextended domain isthe area enclosed by the dotted block in Fig. 4. Then, the sub-sequential model2 and

model3canbeupdatedsuccessfully.

The rSearch process can be further optimized via adaptively adjusting the starting point based on the interval to index. In this way, rSearch does not need to always search from root node, thereby shortening the length of searching path.Based on

Lemma 1, forone interval

(

lt

,

rt

)

,we can startto search for reg-istrationnodefromnoder0

=

2logtr+1

−

1 ratherthanrootnode r,whichcanbereferredtoLine7 inAlgorithm 2.Also,thenodes outsideofsub-domain

[

0

,

2logtr+1+1

−

₂

]

_of_vs-tree _will _not be-come the registrationnode, becausethe interval

(

tl

,

tr

)

doesnot intersectwiththem.Forexample,themodel4inFig. 4.The adap-tivesearchingﬁndsthat thesubtreerootedatnode 7 isthe min-imumone covering model4’sinterval

[

9

,

14

]

.Therefore,node 7 is thestartingpointforregistrationnodesearchingandthelengthof searchingpathisonly1.

4.2.2. Materializationofmodeledsegment

WhenmaterializingmodelMiintotheSSofanode

τ

,the row-keymaybechosenintwoways:

•

UponinitializationoftheSSofnode

τ

:whennomodeled seg-menthasbeenstoredat

τ

’sSS,therowkeyischosenas

τ

,

α

formodel Mi.Here,

α

isapostﬁxofrowkeytoindicatethat thisrowisthestartingpositionof

τ

’sSSinthetable.

•

UponupdatingtheSSofnode

τ

:whentheSSof

τ

hasalready beeninitialized,thetimeinterval

[

lt

,

rt

]

(resp.

[

lv

,

rv

]

forvalue interval)tobeindexedwillbeincorporatedintotherowkey, i.e.,

τ

,

lt

,

rt

(resp.

τ

,

lv

,

rv

intheindex-modeltablefor val-ues). In this way, different modeled segments stored in the sameSSofanodedonotoverwriteeachother.

(7)

Fig. 5.Modeled segment materialization of KVI-index.

The selection ofspeciﬁc

α

should make surethat the binary representation of

τ

,

α

is in front of any other

τ

,

lt

,

rt

. This designisusefulforqueryprocessing.Forinstance,ifthequery pro-cessorrequirestoaccessallthemodeledsegmentsstoredat regis-trationnode5,thenweknowthatallthecorrespondingmodeled segmentslieintherowswithintherow-keyrange

[

5

,

α

,

6

,

α

)

. Forexample,take themodel1andmodel2inFig. 5.First,the KVI-indexcheckswhetherthestartingmodeledsegmentsofnode5and

node15,namelyrowswithkey 5

,

α

and 15

,

α

,exist.Then,the rowkey 5

,

4

,

6

isconstructedformodel1astheSSofnode5has been initialized, whilst KVI-index constructs the rowkey 15

,

α

formodel2.

4.2.3. Complexityanalysis

In this section, we analyze the computational and communi-cationcomplexitiesofupdating operationofKVI-index.The com-plexity analysis of KVI-index includes in-memory and key-value part.

•

In-memory

The rSearch on vs-tree can run within O

(

log

(

R

))

time. The spaceexpansion costs O

(

1

)

time.Onlytherootvaluer of vs-tree iskept inmemory. The valuesofother nodes onvs-tree

canbe calculateddueto thecomputabilityofvs-tree’s space-partition.Therefore,thespacecomplexityofvs-treeis O

(

1

)

.

•

Inkey-valuestore

The update operation on vs-tree does not generate any net-work I/O cost. For an n-th order polynomial model of one sensordatasegment,

(

2

+

n

)

putoperationsareconductedto materializeonemodel.Therefore,thetimecomplexityis O

(

1

)

intermsofnetworkI/O, asn isconstant.Moreover, one seg-mentmodel’stime(orvalue) intervalandcoeﬃcientsareonly materializedonce intothe index-model table,thus the space costis O

(

N

)

fortime(orvalue)index.

5. QueryprocessingviaKVI-indexandMapReduce

Forquerying model-viewsensordata,thesearchingprocess of qualiﬁedmodeledsegments(deﬁnedinSection3)inKVI-index in-cludestwosteps:

•

Intersectionandpointsearch:Theintersectionsearchonvs-tree

isusedforrange queries,whilepoint search isemployed for point queries. They are responsible for collecting the nodes that accommodate qualiﬁed modeled segments in their sec-ondarystructuresSSs.

•

Modeled-segmentfiltering:Duetotheruleforinterval registra-tionatthenodesofthevs-tree,theSSofanodemaycontain some intervals irrelevant to queried range or point. In KVI-index,the SSs ofall nodesfoundby thesearch operationare accessedtofilteroutunqualifiedsegments.

Algorithm 3:iSearch+ofvs-tree.

Input: timequeryrange[lt,rt],rootvalue

r

Output: nodesetS0andD

1 begin

2 /*constructS0

3 node=r;

h

=log(r)−1;

4 while(h ≥0)do

5 if(lt≤node and rt≥node)then

6 break;/*nodeistheregistrationnode

7 else 8 S0=S0∪node 9 if(lt>node)then 10 node=node+2h_; 11 if(rt<node)then 12 node=node−2h_; 13 h ₌h ₋1; 14 /*constructD. 15 ls=2log(lt),

r

s=R −2log(R−rt),D= [ls,rs]

Afterthe above two steps, modelgridding component fetches the coefficientsof each qualifiedmodeled segment andperforms gridding. Next, we first describe an enhanced intersectionsearch algorithmonvs-treethatbenefitsKVI–Scan–MapReducequery pro-cessing,introducedlaterinthissection.Wethenpresentthepoint search algorithmonvs-tree.Subsequently,weintroduceournovel hybrid KVI–Scan–MapReduce queryprocessing. Last,we theoreti-cally analyzethe enhanced intersectionsearch algorithm of KVI-index.

5.1. Intersectionandpointsearch

5.1.1. Enhancedintervalintersectionsearch

Algorithm 3 presents the iSearch+. Given a time (resp. value) rangequery

[

lt

,

rt

]

,iSearch+ ﬁrstcallstherSearch toﬁndthe reg-istration node

τ

of

[

lt

,

rt

]

.The nodesonthe searchingpathfrom the rootnode to the one preceding

τ

form a node set denoted by

S

0.TheiSearch+stopsatthenode,whichisclosesttothe left-endpoint lt.Allthenodesalong theleft-descendingpath forma node set,denotedby

S

l,whilethenode withtheminimumvalue inthispathisdenotedbyls.Analogously,

S

r isthenode setfrom the right-descendingpath andrs isthe node withthemaximum value inthispath.Anynode outsidetherange

[

ls

,

rs

]

andtheset

S

0doesnothaveanyqualiﬁedmodeledsegments.

The traditionalintersection search would returnthe node set

C

=

S

0

∪

S

l

∪

S

r

∪ [

lt

,

rt

]

for further modeled-segment ﬁltering and gridding. OuriSearch+ outputs thediscrete node set

S

0 and a consecutive range of nodes

D

= [

ls

,

rs

]

. For example, take the range query in Fig. 6(a).node7 is the registration node of query range

[

6

,

10

]

.ThetraditionaliSearchreturnsthediscretenodesets shown inthe solid boxes ofFig. 6(a),while our iSearch+ returns a range of nodes

[

3

,

11

]

and

S

0

= {

15

}

. We will see how the output of iSearch+ beneﬁts the hybrid query processing later in Section5.2.

5.1.2. Pointsearch

We denotethe point search by sSearch as it functionsas the stabbing search of interval tree. The sSearch is a binary search thatrecordsthenodesalongthedescendingpath.Wepresentthe

sSearch inFig. 6(a).Forexample,whenqueryingthesensorvalue of time point 24,the node set

S

0

= {

15

,

7

,

11

,

9

,

10

}

is returned by sSearch. Since thereis no split searching,as in iSearch+, only one node set is produced here. We denote this node set by

S

0 aswell,soastofacilitatethedescriptionofthehybridKVI–Scan– MapReducequeryprocessingthatfollowsnext.

(8)

Fig. 6.Workﬂow of KVI–Scan–MapReduce approach. (a)iSearch+andsSearch. (b)SSlocation distribution for the range query. (c) Hybrid processing.

5.2.KVI–Scan–MapReducequeryprocessing

Theconventionalclusteredindexforone-dimensionaldatacan exactly locate a consecutive range of qualified data. Then, the query processor just needs to do range scan on these qualified data.However,asforquerying model-viewsensordata,the chal-lengeis howtotacklethelarge amountofsegmentmodelsfrom the SSs distributed across the index-model table. We first ana-lyze the location distribution of the SSs of the nodes found by

iSearch+ andsSearch intheindex-modeltable. Thecharacteristics ofthis distribution inspired us to propose the hybrid KVI–Scan– MapReducequeryprocessingapproach.

5.2.1. SSlocationdistribution

TherearetwocasesfortheSSdistributionintheindex-model table,describedbelow.

•

S

0:TheSSsof

S

0arenon-consecutiveandsparselydistributed in theindex-modeltable. Thenode valueisthe primarypart oftherow-key;thus,thedistancebetweenSSsof

S

0 depends on the numerical difference of node values. As

S

0 includes the nodes from root node to the one preceding the

τ

, the intra-distancesbetweenanyconsecutivenodesin

S

0 are2h−i, where i

=

0

,

. . . ,

h

−

dτ is the position of the node in the

descending search path

S

0 and dτ is the depth of

τ

.

Obvi-ously,theintra-distancesin

S

0aregreaterthanthoseforother nodesbelow

τ

inthesearchpath.

•

D

:TheSSsof

[

ls

,

rs

]

areclusteredaroundtheonesof

[

lt

,

rt

]

in theindex-modeltable.TheSSsof

[

lt

,

rt

]

arealladjacentinthe index-model table.TheSSs of

[

ls

,

rs

]

areboundedbythoseof thesub-treerootedat

τ

andthenodesin

[

ls

,

rs

]

areasuperset of thenodes in

[

lt

,

rt

]

.The deeper theregistrationnode

τ

is located, thetighter the setofthe SSsof

[

ls

,

rs

]

over thoseof

[

lt

,

rt

]

.

For example, take the time (or value) query range

[

6

,

10

]

in

Fig. 6(a). The registration node is node7. Then,

S

0

= {

15

}

and

D

= [

3

,

11

]

. The sub-treerooted at node7 coversthe node range

E

= [

0

,

14

]

and

D

⊂

E

.From Fig. 6(b),theSSsof

D

are clustered around those of

[

6

,

10

]

and bounded by the SSs of

E

. However,

node15’sSSislocatedfarawayfromthoseof

[

3

,

11

]

.

IfSSs of

S

0 and

D

are processed via straightforward random access and range scan provided by key-value stores, the entire modeled-segment ﬁltering and gridding processes are conducted locallyattheapplicationside.Foratableofmultipleorhundreds

ofGBs,thecommunicationandcomputationcostsareprohibitively highfortheapplicationsideevenforlow-selectivequeries.

The modeled segment ﬁltering–gridding processing matches MapReduce’s ﬁltering-aggregation paradigm. Considering the re-search results from[31–33], for CPUnon-intensive workload,I/O cost, network latency and starting-up overhead of mappers are dominant in the execution time of MapReduce programs. If the

SSs of

S

0 and

D

are all processed by MapReduce, a lotof time is wasted for mappers that process irrelevant SSs in the index-modeltable.ThisisbecauseMapReducewillaccessthecontinuous regions of the table including the SSs of nodes between the

S

0 and

D

duetothesequentialdatafeedingmechanisminthe map-ping phase. For example,in Fig. 6(b), theSSs of

D

= [

3

,

11

]

and

S

0

= {

15

}

are distantinthe table.Hence, MapReducewill launch manyunnecessarymappersfortheirrelevantSSsofnodesbetween 11 and15,inordertoprocesstheSSsof

S

0and

D

.

5.2.2. Hybridmodelﬁlteringandgridding

As discussed above, simply using range scan or MapReduce to process SSs are both problematic. Ouridea isto design a hy-bridKVI–Scan–MapReduceparadigmthatcombinesrangescanand MapReduceforprocessingSSs,asfollows:

•

(1)

S

0: the height ofvs-tree is bounded by log

(

R

)

,andthus theamountofcomputationon

S

0 islimited.AstheSSs of

S

0 aresparselydistributedintheindex-model tableandeachSS

of

S

0canbeconsideredtobeasmallrangeofclusteredindex, therandom-access- andrange-scan-basedmodelﬁlteringand griddingissuitable.

•

(2) D

= [

ls

,

rs

]

: the successive range

[

ls

,

rs

]

delimits a tight boundary ofthesub-index-model table overthe relevantSSs thataresuitableforprocessingwithMapReduce.

This hybridparadigm eliminates the Map-phaseprocessing of

SSsofirrelevantnodesbetween

S

0and

D

andthenodesbetween theelementsof

S

0.Moreover,itisnon-intrusiveforboththe key-value store andMapReduce. Regarding the time (or value)point query, it only produces the node set

S

0 without

D

, hence, only range-scan-basedmodelﬁlteringandgriddingisneeded.

Supposethatthenumberofreducersis P andeachreducer is denotedby 0

,

. . . ,

P

−

1.Forrangequeries,the partitionfunction

f isusedtoassignthe qualiﬁedmodeledsegments intodifferent reducers. It is designed on the basis of query time (resp. value) range

[

lt

,

rt

]

(resp.

[

lv

,

rv

]

) andthetime (or value)interval

[

li

,

ri

]

ofeachmodeled segmenti.The ideaisthateach ofthereducers

(9)

isincharge ofoneevensub-range rt−lt

P .Suchapartitionfunction f isgiveninEq.(1). f

(

ri

)

=

lt

≤

ri

≤

rt

₍ri−lt)∗P rt−lt

ri

≥

rt P

−

1 (1) Thefunctionalitiesofmappersandreducersaredepictedin de-tailbelow.

•

Mapper: Each mapper gets the time (resp. value) interval

[

li

,

ri

]

of one modeled segment i to check whether it inter-sectswiththequerytime(resp.value)range.Forthequaliﬁed modeledsegments,theintermediatekeyisderivedbythe par-tition function f

(

ri

)

. The model coeﬃcients p1i

,

. . . ,

pni

are thevaluepartoftheintermediatekey-valuepair.

•

Reducer: One reducer receives a list of qualiﬁed modeled

segments p1

0

,

. . . ,

pn0

,

p11

,

. . . ,

pn1

,

. . .

.Foreachmodeled seg-ment p1

i

,

. . . ,

p n

i

,thereducerinvokesamodel-basedgridding functiontocomputediscretevaluesasqueryresults.

Regarding the scan-based modelﬁltering and gridding, asSSs in

S

0arelocatedindifferentregionsoftheindex-modeltable,the query processor makes use of thread pool to process each SS of

S

0 inparallel.Fig. 6showstheworkﬂow ofthehybridKVI–Scan– MapReduce approach. For a time (or value) range query

[

6

,

10

]

,

iSearch+constructsthenode set

S

0

= {

15

}

and

D

= [

3

,

11

]

.Then, theSSsofthenodesin

D

aresenttoMapReduce.TheSSofnode15, enclosed by the bottomdot-dashed block, is processed viarange scan.

5.3. Theoreticalanalysis

One point to carefullyconsider isthat iSearch+ may generate redundant nodes, because the iSearch+ aims to ﬁnd a tight and consecutiverangeofSSsforMapReduce.Forinstance,inFig. 6(a), theSSofnode4 isnotaccessedbytheiSearch+,butisinthe sub-tableprocessedbyMapReduce.

Theorem1.Forarangequery

[

lt

,

rt

]

,theredundantnodesin

[

ls

,

rs

]

returnedbyiSearch+arebounded.

Proof. Assumehtobetheheightoftheregistrationnode

τ

.

Con-sidertheleft-descendingpathfrom

τ

tothenode l0 closest tolt. Letd be the depth from which the descending path turns right, namelythevalue w

<

lt ofthe currentnode.Then, basedonthe iSearch+algorithm, wistheleftboundaryofaccessednoderange andw

=

τ

−

d

i=12h−i.

Fora certainvalueofd,theworstcasehappenswhenthe de-scending process continues to go right until reaching l0, as the nodesbetween w andl0 are all redundantones. The number of nodesreturnedbyiSearch+underthiscaseisgivenby:

U

=

τ

−

τ

−

d

i=1 2h−i

(2) The nodesbetween

τ

and w are allincluded into theoutput range

D

of iSearch+.The numberof nodesreturned by the con-ventionaliSearchisgivenby:

V

=

h

−

d

+

τ

−

τ

−

d

i=1 2h−i

+

h

i=d+1 2h−i (3)

Therefore,thenumberofredundantnodesreturnedbyiSearch+

isgivenby: f

=

U

−

V

=

d

+

h

i=d+1 2h−i

−

h

=

d

+

2h−d

−

h

−

1 (4)

Eq. (4) is a function of d and is monotonous decreasing in d’s domain

[

1

,

h

]

. Consequently,when d

=

1, the function f reaches themaximumvalue,namely,thenumberofredundantnodesfrom

iSearch+attainsthemaximumvalue fmax thatisgivenby:

fmax

=

2h−1

−

h

.

(5) Asthetotalnumber

η

ofnodesofthesub-treeoftheleftchild of

τ

is2h

−

1,hence

f

≤

1

2

η

−

log

(

η

+

1

)

+

1

2

.

(6)

Insummary,thetotalnumberofredundantnodesintherange

[

ls

,

rs

]

isbounded.

2

The worst case happens when the endpoints lt and rt are the preceding andsucceeding nodesof

τ

,namelylt

=

τ

−

1 and rt

=

τ

+

1.However, formostof thecases, the redundantnodes returnedfromiSearch+areverylimited.

6. Experimentalevaluation

First, we compare model-view sensor data query processing with conventional one over raw sensor data. Then, we show that ourKVI–Scan–MapReduce(KSM) approachoutperformsother model-view sensor data querying approaches. Finally, we experi-mentally explore the factors that affect theperformance ofKVI– Scan–MapReduce.

6.1. Setup

We employ accelerometer datafrommobile phones assensor dataset.Thesizeofrawsensordatais22GBincluding200million data points. Aftermodeling,themodeled segments ofthe sensor datatake 12GB,whilethereare around25million modeled seg-ments.

We developed our system using the versions of HBase and HadoopinClouderaCDH4.3.0.The experimentsareperformedon our own cluster that consists of 1 master node and 8 slaves. The master node has 64 GB RAM, 3 TB disk space (4

×

1 TB disks in RAID5) and12 cores, each of which is 2.30 GHz (Intel XeonE5-2630). Eachslavenode has6cores2.30GHz(IntelXeon E5-2630),32 GBRAMand6 TB diskspace(3

×

2 TB disks).Nodes are connectedvia 1 GB Ethernet. Inthe experimentalresults, we refer to query selectivity asthe ratio ofthe number of qualiﬁed modeledsegmentsoverthatoftotalmodeledsegments.

The datasetcontains discreteaccelerometer datafrommobile phonesandisasequenceoftupleseachofwhichhasone times-tamp and three sensor values representing the coordinates. The size of the raw sensor data set is 22 GB including 200 million data points. We simulate the sensor data emission, in order to segmentandupdatesensordataintotheKVM-indexinanonline manner. Weimplementanonlinesensordatasegmentation com-ponent [5] applying the PCA (piecewise constant approximation)

[1],whichapproximates onesegment witha constantvalue (e.g., themeanvalueofthesegment).Sincehowtosegmentandmodel sensordataisnotthefocusofthispaper,othersensortimeseries segmentation approachescouldalsohavebeen appliedhere. Pro-videdthat the segments arecharacterized by thetime andvalue intervals, ourKVM-indexandrelatedquery-processingtechniques are able to manage them eﬃciently in the cloud. Finally, there

(10)

Fig. 7.Sensor data updating performance.

are around 25 million sensordata segments (nearly 15 GB) up-loaded intothe key-value store. Regarding the segment gridding, wechoose1 secondasthetimegranularitywhichis application-speciﬁc.

6.2.Indexupdating

Fig. 7showstheaverageupdatingtimeofeachsegmentandthe average insertion time of each raw sensor data point during the datauploading phase. Both time and value index keep relatively stableupdating eﬃciency.The updating ofthevalue KVI-indexis

alittleslowerthanthetimeKVI-index.As discussedbefore,since the domain ofthe value vs-tree is smallerthan that ofthe time

vs-tree,thevalueindexperformsmoreSSupdatingoperations (dis-cussed in Section 4.2) than the time index and therefore incurs morenetworkI/Ocost.Therawsensordatainsertionisthefastest but the amount of data to update is much larger than model-view approach.This is because model-view sensor data achieves datacompressionovertherawsensordatatherebydecreasingthe amountofdatatoupload.

6.3. Model-viewsensordatavs.rawsensordata

We create two tables, which take the time-stamp and sensor value as the row-keysrespectively, such that the queryrange or pointcanbeusedaskeystolocatethequaliﬁeddatapoints.Then, the query processor invokes the MapReduce to access the large sizeofdatapointsforgettingqueryresults.

Figs. 8(a),(b)and(c)presentthequeryresponsetimesfortime range, value range and point queries respectively. As shown in

Figs. 8(a)and(b),themodel-viewapproachtakesaround30% less time than the raw sensor data method for both time and value rangequeries.Althoughtherawsensordatabasedmethodsapply MapReducetodirectlyaccessthequaliﬁedtuplesviatherow-key based range scan, the amount of raw sensor data to process is much larger than that of the model-view approach. In Fig. 8(c), theprocessingtimeoftherawdatabasedmethodis2

×

lessthan thatofthemodel-viewoneintimepointqueries,becausetheraw datamethodcanusethequerytimepointasindexkeytodirectly accesstherelevantdatapoints,whileourKSMrequirestoperform model ﬁlteringandgridding. For value point queries,the

(11)

Fig. 9.Query performance comparison of different model-view approaches. (a)–(c): time range queries, (d)–(g): value range queries.

viewapproachhasnearly3

×

lesstimethantherawdatamethod. As,normally,thereisalarge sizeofdatapointswiththequeried value,MapReduce isusedtoaccessthisqualiﬁedsensordataset. Inthemodel-viewapproach,thepointqueryprocessingonlyuses randomaccessandrangescan togetqualiﬁedmodeled segments forgriddinglocally,andthusitsavesthetimeonstarting MapRe-ducetoaccessdata.

6.4. Comparisonofmodel-viewapproaches

There are four baseline approaches for querying model-view sensordata,namely:

MapReduce (MR). This approach utilizes MapReduce without supportfromanyindex.Italwaysworksonthewholeindex-model table to ﬁlter the qualiﬁed modeled segments in the mapping phaseandperformthemodelgriddinginthereducephase.

Intervaltree(IT).Weimplementedthetraditionalquery process-ingoperationsoftheintervaltree[23,26]byaddinganothertable tostoretheSSofeachnodesortedbytherightend-pointof inter-vals.Eachindex, time orvalue, hastwoassociated tables.During the intersection or point search on vs-tree, the query processor decides whichtable to accessbased onthe relationbetweenthe queryrange(orpoint)andthenodevalue.Inthisway,thequery processor can stop scanning once it encounters one unqualified modeledsegment,duetothemonotonicityofend-pointsof mod-eledsegments. IT makes useofrandomaccessandrangescan to sequentiallyfilterthequalifiedmodeledsegmentsandmake grid-dinglocally.

MapReduce+KVI(MRK).TheideaofMRKistoleverageKVI-index to avoid having MapReduce to process the whole table. In MRK, MapReduce is designed to work over one continuous sub-index-modeltable includingallthe SSs oftheaccessed nodesinsearch

operations. Forinstance, in Fig. 6(a), fora time (or value) range query

[

6

,

10

]

,MRK invokesMapReduce to work on thesub-table within the row-keyrange

[

3

,

α

,

16

,

α

]

.The same idea applies forpointqueries. Ascomparedto ourhybridKSMapproach,MRK

isalightweightindexing-MapReduceplan,asitprocessesmany ir-relevantSSsofnodesbetween

S

0 and

D

.

Filter ofkey-value store(FKV). Some key-value stores such as HBaseprovideafilterfunctionalitytosupportpredicate-basedrow selection [20]. The filtertransmits the filtering predicate toeach region server and then all servers scan their local datain paral-lel. Afterwards, they return the qualified tuples. Our filter-based queryprocessingalsoworkson theindex-modeltable, asthe fil-teringpredicatescanbedirectlyappliedtothecolumns.Thequery processorwaitsuntilallregionserversfinishscansandthenit re-trieveseach returnedqualifiedmodeledsegment toconduct grid-dinglocally.

6.4.1. Rangequery

Figs. 9(a), (b) and(c) present the performance of time range queries. As depicted in Fig. 9(a), KSM outperforms MRup to 3

×

for thelow-selective time range queries.As the query selectivity increases, the amountof SSs forscan basedprocessingdecreases andthatforMapReduceapproachestheentiretable.Therefore,the responsetime increasesandapproachesthatofMR.The response timeofMRincreaseslittle.Asincreasingqueryselectivityleadsto ascending gridding workloadin reduce phase,theseresultsshow that the overhead from model gridding is not dominant in MR. TheresponsetimeofMRKismorethanthatofKSM,butlessthan thatofMR.AsMRK utilizestheKVI-indextolocalizeaconsecutive sub-index-modeltable covering allthe SSsof nodesfoundby in-tersectionsearch,itprocessesfewermodeled segmentsthanMR’s fulltablescanning.Yet,ascomparedtoKSM,MRKprocessesmore

(12)

redundantmodeledsegments. Moreover,asthe sub-tableinMRK

covers a large range,the processing time of MRK increaseslittle forlow-selectivequeries.

Fig. 9(b)exhibitstheperformanceofITandFKV approaches.As

FKV needs to wait for each region server of HBase to ﬁnish the localdatascanning, its totalresponse time isa littlelonger than thatofIT approach.Theybothconsumemuchmoretimethanall

MR,MRK andKSM,astheyapply sequentialaccessingofmodeled segments.

Wealsoanalyze thenumberofmodeledsegmentsaccessedby each approach in Fig. 9(c). These experiments show how differ-entaccessmethodsofmodeled segmentsaffectthe performance.

MRworksonthe entiretable,thus, thenumberofaccessed seg-ments is the same. From the point of view of the application, only qualiﬁed modeled segments are returned forgridding, thus

FKV processesnoredundantmodeledsegmentsandconsumesthe leastamount ofmodeled segments. Since IT scansthe SSof one node until encountering an unqualiﬁed model, the total number ofaccessedsegments is alittle largerthan that ofFKV.OurKSM

processes larger number ofsegments than both IT and FKV due tothecontinuous andredundantrangeof SSs found by iSearch+. However, the results also verify our theoretical analysis that the amountofredundantmodeledsegmentsisbounded.MRKaccesses moresegmentsthanIT,FKV andKSM,asitaddstheSSsbetween

S

0 and

D

to forma continuous sub-tableforMapReduce. Refer-ring to Fig. 9(a) andFig. 9(b),although KSM approach consumes moresegments thanIT andFKV,its hybridparadigm isthemost eﬃcient.

Figs. 9(d), (e) and (f) present the value range query perfor-mance. Thedifferent queryprocessingapproaches exhibit similar patterns as for the time range queries, so we skip the detailed analysis.

6.4.2. Pointquery

The time and value point query processing performance are showninFig. 10.IT wins bothfor time andvalue point queries. The response time of KSM is a little greater than IT, but out-performs the other approaches, because IT is able to access all qualified modeled segments in one SS. However, the KSM scans thewholeSSentriesofanodetofindthequalifiedones.Because oftheinvocationofMapReduceandredundantmodeledsegments inthesub-table,MRK takesmoretimethanbothITandKSM.But, asMRK does not work on the entiretable as MR does, it takes about2

×

lesstime thanMR. FKV consumes themost time asit needs to wait for server-sidefull table scan before gridding op-erations. Since the size ofthe domain of the sensor data values is smaller than that of the time domain, nodes in value vs-tree

accommodate more modeled segments than time vs-tree nodes. Thus,theresponsetimesofvaluepointqueriesofIT,FKVandKSM

approachesare allmorethanthoseoftimepoint queries.ForMR

andMRK,theprocessingtime differencesbetweentimeandvalue queriesareinsigniﬁcant,asthetime spentonmodelﬁlteringand griddingisnotdominant.

6.5.InsightsintoKVI–Scan–MapReduce

Section 6.5.1 discusses effect of the searching depth on the performances of in-memory iSearch+, sequential scan based and MapReduce based model ﬁltering and gridding. At last, we will see the workload constitutions within the KVI–Scan–MapReduce paradigminSection6.5.2.

6.5.1. Searchingdepth

InFigs. 11, 12 and 13, we presentthe results fromtime and value range queries of 10% selectivity and 10% to 50% iSearch+

depth.The percentageof iSearch+ depth heremeansthe ratioof

Fig. 10.Query performance of point queries.

Fig. 11.Effect oniSearch+.

Fig. 12.Effect on sequential scan.

(13)

Fig. 14.Query processing time constitution of time range queries.

Fig. 15.Query processing time constitution of value range queries.

registration node searching depth over the height of vs-tree in

iSearch+.AsiSearch+depthincreases,thenumberofSSsforrange scan based modelprocessing increases. Therefore, the time con-sumed by iSearch+ andrange scan based model processingboth increases, shown inFigs. 11 and 12. As forthe MapReduce part, thedeeperthelevelofregistrationnode

τ

is,thesmallerthespace coveredbythesub-treerootedat

τ

is.ThentherangeofSSsin

D

istighterover therangeofSSsofqueryrangeandresultsinless redundantSSs for MapReduce, which is theoretically analyzed in Section5.3.FromFig. 13,wecanseeasalientdecreasingtrendof MapReduceprocessingtime.

6.5.2. Workloadconstitution

Thisexperimentaimstorevealhowmuchtime theKVI–Scan– MapReduce spends on model gridding, which is a difference of model-view approach from raw sensor data approach. Figs. 14 and 15showtheresultsfromtimeandvalue rangequeriesof se-lectivityfrom10% to50%.

From Figs. 14 and 15, the time on model gridding accounts for 1

/

3–1

/

2 of the total query processing time. As the majority of gridding work is done in the reduce phase and the amount of qualiﬁed segment models sent to reducers depends on the query selectivity, the time spent on modelgridding increases as thequeryselectivityincreases.Ifthemodelgriddingcan adaptto users’differentrequirementsforqueryresults,theperformanceof theKVI–Scan–MapReduceschemecanbefurtheroptimized.

7. Conclusion

To the best of our knowledge, this is the ﬁrst work to ex-plorethekey-valuerepresentationofan intervalindexfor model-viewbased sensordatamanagement.Differentfromconventional

external-memoryindexstructurewithcomplexnodemergingand splitmechanisms,ourKVI-index,residentpartiallyinmemoryand partially materialized in the key-value store, is easy to maintain inthedynamicsensordatagenerationenvironment.Moreover,we proposed a hybrid queryprocessing approach, namelyKVI–Scan– MapReduce,integratingtheKVI-index,rangescan andMapReduce formodel-view sensordatain key-valuestores. Extensive experi-ments inarealtestbedshowedthat ourapproachoutperformsin terms of query response time and index updating eﬃciency not onlyqueryprocessingmethodsbasedonrawsensordata,butalso all otherapproachesconsidered basedonmodel-viewsensordata fortime/valuerangeandpointqueries.Asafuturework,weplan to explore howto processtime andvalue compositequeries and joinqueriesbasedontheKVI-index.

Acknowledgement

This work is supported by the European Commission (EC), ProjectFP7-ICT-2011-7-287305OpenIoT.

References

[1]S. Sathe, T.G. Papaioannou, H. Jeung, K. Aberer, A survey of model-based sen-sor data acquisition and management, in: Managing and Mining Sensen-sor Data, Springer, 2013.

[2]A. Thiagarajan, S. Madden, Querying continuous functions in a database system, in: SIGMOD, 2008.

[3]A. Deshpande, S. Madden, MauveDB: supporting model-based user views in database systems, in: SIGMOD, 2006.

[4]T. Papaioannou, M. Riahi, K. Aberer, Towards online multi-model approximation of time series, in: MDM, 2011.

[5]T. Guo, Z. Yan, K. Aberer, An adaptive approach for online segmentation of multi-dimensional mobile data, in: Proc. of MobiDE, SIGMOD Workshop, 2012. [6]H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh, Querying and mining

of time series data: experimental comparison of representations and distance measures, VLDB Endow. 1 (2008) 1542–1552.

[7]E. Keogh, S. Chu, D. Hart, M. Pazzani, An online algorithm for segmenting time series, in: Proceedings of the IEEE International Conference on Data Mining, ICDM 2001, IEEE, 2001, pp. 289–296.

[8]Y. Cai, R. Ng, Indexing spatio-temporal trajectories with Chebyshev polyno-mials, in: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, ACM, 2004, pp. 599–610.

[9]Q. Chen, L. Chen, X. Lian, Y. Liu, J.X. Yu, Indexable PLA for eﬃcient similarity search, in: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB Endowment, 2007, pp. 435–446.

[10]Saket Sathe, Thanasis G. Papaioannou, Hoyoung Jeung, Karl Aberer, A survey of model-based sensor data acquisition and management, in: C.C. Aggarwal (Ed.), Managing and Mining Sensor Data, Springer US, 2013.

[11]A. Bhattacharya, A. Meka, A. Singh, Mist: distributed indexing and querying in sensor networks using statistical models, in: VLDB, 2007.

[12]A. Deshpande, C. Guestrin, S.R. Madden, J.M. Hellerstein, W. Hong, Model-driven data acquisition in sensor networks, in: Proceedings of the Thirtieth In-ternational Conference on Very Large Data Bases, in: VLDB Endowment, vol. 30, 2004, pp. 588–599.

[13]E. Keogh, K. Chakrabarti, M. Pazzani, S. Mehrotra, Locally adaptive dimension-ality reduction for indexing large time series databases, SIGMOD Rec. 30 (2) (2001) 151–162.

[14]J. Shieh, E. Keogh, iSAX: indexing and mining terabyte sized time series, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 623–631.

[15] OpenTSDB,http://opentsdb.net/,2011.

[16]T. Guo, T.G. Papaioannou, K. Aberer, Model-view sensor data management in the cloud, in: 2013 IEEE International Conference on Big Data, IEEE, 2013, pp. 282–290.

[17]T. Guo, T.G. Papaioannou, H. Zhuang, K. Aberer, Online indexing and distributed querying model-view sensor data in the cloud, in: The 19th International Con-ference on Database Systems for Advanced Applications, 2014.

[18]F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chan-dra, A. Fikes, R.E. Gruber, Bigtable: a distributed storage system for structured data, ACM TOCS 26 (2008) 4.

[19]J. Dean, S. Ghemawat, MapReduce: simpliﬁed data processing on large clusters, Commun. ACM 51 (2008) 107–113.

[20]L. George, HBase: The Deﬁnitive Guide, O’Reilly Media, Inc., 2011.

[21]C.P. Kolovson, M. Stonebraker, Segment indexes: dynamic indexing techniques for multi-dimensional interval data, SIGMOD Rec. 20 (2) (1991).

(14)

[22]G. Sfakianakis, I. Patlakas, N. Ntarmos, P. Triantaﬁllou, Interval indexing and querying on key-value cloud stores, in: ICDE, 2013.

[23]L. Arge, J. Vitter, Optimal dynamic interval management in external memory, in: 37th Annual Symposium on Foundations of Computer Science, 1996. [24]R. Elmasri, G.T.J. Wuu, Y.-J. Kim, The time index: an access structure for

tem-poral data, in: VLDB, 1990.

[25]T. Bozkaya, Z.M. Özsoyoglu, Indexing valid time intervals, in: DEXA, 1998. [26]H.-P. Kriegel, M. Pötke, T. Seidl, Managing intervals eﬃciently in

object-relational databases, in: VLDB, 2000.

[27]H.-P. Kriegel, M. Pötke, T. Seidl, Object-relational indexing for general interval relationships, in: Advances in Spatial and Temporal Databases, Springer, 2001. [28]S. Nishimura, S. Das, D. Agrawal, A.E. Abbadi, MD-HBase: a scalable

multi-dimensional data infrastructure for location aware services, in: MDM, 2011.

[29]J. Wang, S. Wu, H. Gao, J. Li, B.C. Ooi, Indexing multi-dimensional data in a cloud system, in: SIGMOD, 2010.

[30]M.-Y. Iu, W. Zwaenepoel, HadoopToSQL: a MapReduce query optimizer, in: Eu-roSys, 2010.

[31]J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing), VLDB Endow. 3 (2010) 515–529.

[32]J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, J. Schad, Only ag-gressive elephants are fast elephants, VLDB Endow. 5 (2012) 1591–1602. [33]M.Y. Eltabakh, F. Özcan, Y. Sismanis, P.J. Haas, H. Pirahesh, J. Vondrak,

Eagle-eyed elephant: split-oriented indexing in Hadoop, in: EDBT, 2013.

[34]H.-C. Yang, D.S. Parker, Traverse: simpliﬁed indexing on large map-reduce-merge clusters, in: DASFAA, 2009.

Big Data Research 1 (2014) 52–65

ScienceDirect

www.elsevier.com/locate/bdr

(http://hbase.apache.org/

(http://hive.apache.org/

S. Sathe, T.G. Papaioannou, H. Jeung, K. Aberer, A survey of model-based sen-sor data acquisition and management, in: Managing and Mining Sensen-sor Data,

A. Thiagarajan, S. Madden, Querying continuous functions in a database system,in: SIGMOD, 2008.

A. Deshpande, S. Madden, MauveDB: supporting model-based user views in

T. Papaioannou, M. Riahi, K. Aberer, Towards online multi-model approximationof time series, in: MDM, 2011.

T. Guo, Z. Yan, K. Aberer,

H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh, Querying and mining of time series data: experimental comparison of representations and distanc