T-Hoarder: a framework to process Twitter data streams

(1)

This is a postprint version of the following published document:

Mariluz Congosto, Pablo Basanta-Val, Luis Sánchez-Fernández,

T-Hoarder: A framework to process Twitter data streams, In Journal of

Network and Computer Applications (2017), vol. 83, pp. 28-39

https://doi.org/10.1016/j.jnca.2017.01.029

© 2017 Elsevier Ltd. All rights reserved.

This work is licensed under a

Creative Commons

Attribution-NonCommercial-NoDerivatives 4.0 International License

.

(2)

T-Hoarder:

A

framework

to

process

Twitter

data

streams

Mariluz

Congosto,

Pablo

Basanta-Val,

Luis

Sánchez-Fern

á

ndez

With the eruption of online social networks, like Twitter and Facebook, a series of new APIs have appeared to allow access to the data that these new sources of information accumulate. One of most popular online social networks is the micro blogging site Twitter. Its APIs allow many machines to access the torrent simultaneously to Twitter data, listening to tweets and accessing other useful information such as user profiles. A number of tools have appeared for processing Twitter data with different algorithms and for different purposes. In this paper T Hoarder is described: a framework that enables tweet crawling, datafiltering, and which is also able to display summarized and analytical information about the Twitter activity with respect to a certain topic or event in a web page. This information is updated on a daily basis. The tool has been validated with real use cases that allow making a series of analysis on the performance one may expect from this type of infrastructure.

1. Introduction

Recently, thegeneralizationof microblogging (Javaetal.,2007; HoneyandHerring,2009)byawidesectorofsocietyhascontributedto transformthewayinwhichpeopleconsumeinformation,spreadit,and interactwith others.Theavailabilityof mechanismstopublishshort messages, pictures,and videos, togetherwith theiruse frommobile phones,allowsthatinformationtoﬂowinrealtimethroughanetwork of users.Thischange alsotransformsthewaydataisprocessed and integrated in diﬀerentbusiness models which take into account the informationstoredinmicrobloggingrecordstodeterminethenextset ofdecisionstobetakeninabusinessinfrastructure(Gudivadaetal., 2015;Manyikaetal.,2011;ZikopoulosandEaton,2011).

Twitter is themost popular micro blogging network.2 It ischar acterizedbythelimitinthelengthofitsmessages,calledtweets(140 characters)andbytheasymmetricrelationsbetweentheirusers.Atthe moment,500milliontweetsarepublisheddaily,whichmeansahuge sourceofsocialdatathathascaughtattentionfromresearchers.From 2008,itisalsoasourceofstudybyacademicresearchers(Hubermanet al.,2008)andithasbeenappliedinanumberofsocialﬁeldssuchas elections(Conoveretal.,2010;GayoAvello,2011;BarberáandRivero,

2012),socialmovements(Peñalópezetal.,2014),predictions(Bollen etal.,2011a;AsurandHuberman,2010),user'sinﬂuence(Chaetal., 2010),behavior(Bollenetal.,2011b;Doddsetal.,2011),andmessage propagationanalytics(DeDomenicoetal.,2013).

Currently, it is possible to accede to the complete volume of information on Twitter bymeans of payment through GNIP.3_Also, researchersmayaccessapartialsetoftweetsandcollectthedataby meansofspeciﬁcTwitterAPIs.Sinceitsinception,TwitterAPIscanbe used without any speciﬁc restrictions. This favored the creation of servicesthatcollecteddataliketweetbackup.comorTwapperKeeper.4

However this initial configuration changed a few years ago (Septemberof2012),withanewruleofusefortheAPIsthatprevented userssharingtweets.5_Currently,_the_rules_of_Twitter's_API6_only_allow sharingdatasetsofalimitedsize.Inthiscontext,sometoolsemerged toprocesstweetsdirectlyfromTwitter.Currently,thesetoftoolsthat canbeusedtoaccesstweetsincludesTwapperKeeper(OBrien,2011), TwitterTap(Kranjc,2014)andTwitterstreamtoMongodb(DelFresno, 2012).However,theirspecificconstraintsbringinnewconcernsandas aresultmoreprocessingcapacityisrequired.Inaddition,thespecific implementationdecisionshamperthetypeof analysisonemaycarry outwiththistypeoftools.

In a reaction to produce a more open, cost eﬀective, andﬂexible

2_{〈https://about.twitter.com/company〉.} 3_{〈https://gnip.com/〉.}

4_{〈https://twapperkeeper.wordpress.com/2011/02/22/removal-of-export-and-download-api-capabilities/〉.} 5_{〈https://dev.twitter.com/archive/terms/api-terms/di}_ﬀ_{-20130702〉.}

(3)

•

TwitterAPIV0.0:from2006toJuneof2010.

•

TwitterAPIV1.0: fromJuneof2010toJuneof 2013.Thebasic authentication,basedonuser/password,inRESTAPI,disappeared anditisreplacedbyOAuth(Leiba,2012).

•

TwitterAPIV1.1:fromJuneof2013todate.Itintegratesthesearch APIintheRestAPI,extendingtheauthenticationbyOAuthtothe StreamingAPIandmodifyingthespeedlimits. Atthemoment of writingthisarticle,TwitterprovidestwoAPIstoretrievedata:the RESTAPIandtheStreamingAPI.

1.RESTAPI:itallowsalltypeofqueriestothedataofTwitterina synchronousform.

2.Streaming API: it establishes a socket between Twitter and a serverbywhichinformationisasynchronouslyreceived. T Hoarder uses the Tweepy library (tweepy.org) that resolves low level access to the API. This library is widely used and updated to the latest API version. However, the source code is available in the Github repository. On the other hand, so far Twitter has announced new

versions with almost a year in advance allowing to face the changes. In the last three years the APIs changes have been limited to the JSON structure of the tweets.

In both APIs it is necessary to consider the restrictions that Twitter imposes in its use:

1.Speed limit: It restricts the number of queries to the REST API for

ﬁfteen minutes. The limit depends on the type of query. In the case of the Streaming API, the tweets per second (TPS) are limited to 50. 2.Time recover limit: It is possible to recover tweets from previous seven days with/search/tweetsGET method of the REST API. The last 3200 tweets of a user with method GET /statuses/ user timelineof REST API and no previous tweets are available with Streaming API.

3.Tweets sample: Only the /statuses/ user timeline GET method of the REST API provides a complete sample of the last 3200 messages of a user. In the/search/tweetsmethod GET of the REST API and the API streaming it is provided only a reduced percentage of tweets (85 95%), even when the frequency is less than 50 tweets per second.

2.2. Secure access

In order to access an API of Twitter, it is necessary to authenticate via the OAuth protocol. In practical terms, it is required to create an application in Twitter to obtain the keys of access for that application (Consumer KeyandConsumer Secret). Also, it is necessary to have a user in Twitter to generate the access key, calledAccess token, for that application. This way, Twitter knows what user accesses and from which application.

Within the options of the Streaming API, it is possible to obtain tweets by means of randomﬁltering, and also to the total stream of tweets. This option is only available for resellers of data; however, since its inception, it has been available for some researchers.

Currently, it is possible to select a set of tweets by one of these three forms:

1.Keywords: it allows obtaining the tweets that adjust to a pattern search providing a list of words, and comma separated expressions. The expressions are words separated by spaces that are in a message, with independence of the order in which they appear, but fulﬁlling the condition. The commas that separate the words or expressions act like OR logic operators and in an expression the spaces that separate the words behave like AND logical connectors. For instance, the search“#metromadrid, Madrid metro”would provide tweets that contain#metromadridOR(metroANDMadrid)

2.By users:it allows retrieving tweets of the users provided by means of a list with its separated identiﬁers by commas. Also, all the reactions to these messages are obtained including automatic or manual broadcastings (RTs), mentions, appointments, and answers. 3.By geo localization:In this case, application receives tweets pub

lished from one or several rectangular geographic zones, each of them delimited by a superior east coordinate and another inferior west coordinate.

Table 1

Comparison among the main methods to access Tweets.

API Recover Search type Delivered

Tweets

Streaming None Keywords, users, or locations If speed < 50 tweets per second, it recovers 80 90% of tweets

REST GET search 1 week back Keywords, users, or locations Unknown, because it only delivers the most relevant tweets. It is always less information than Streaming API.

REST GET statuses 3200 Last tweets Tweets that belong to one user All approach to Twitter data stream processing, THoarder has been proposed. It is a minimal framework that stores data into folders, providing alightweight approach toprocesstweets.Additionally, T Hoarderincludesanenginethatcarriesoutdataanalytic processing anddisplaystheminthreeaxes:time,space,andrelevance.THoarder isdesignedtodealwithamediumtolargenumberandsizeofdatasets, togatherdataforlongperiodsoftime(years),tobeabletoanalyze propagations,andtoidentifytheoriginofdata.Itsarchitectureisalso ofinterestbecauseitsunderlyingideascanbeusedinotheravailable engines forstreamprocessing.Italsoprovidessolutionstocommon problemsinthewaydataisprocessedand/orgraphicallydisplayed.

The rest of this article introduces THoarder infrastructure in detail.Itstartsbyanintroductiontothenecessaryaspectsthatdiscuss thequalityoftheinformationavailableinTwitter(Section2),which establish the computational limitations related to access tweets. It continueswithananalysisofthemostrelatedworkinitiatives(Section 3),whicharecomparedwithTHoarder.Then,THoarderisproperly described (inSection4)tobe laterevaluatedinSection5with real traces taken from Twitter and processed with THoarder. Lastly, Section6drawsconclusionandalsodescribesourongoingwork. 2. Tweets as source of information

TwitterwasapioneerinopeningitsdatabymeansofAPIsfrom 2006. This resulted in the emergence of multiple applications that accessandprocessTwitterdataandhasmadeiteasierforresearchers to obtain datasets for their experiments. In 2009, Twitter allowed access to the stream of tweets by means of the Streaming API, a restrictedversionofamorecompleteapproachcalled“Firehose”. 2.1. AccessingTwitterAPIs

The versionsoftheir APIshavechanged theformof access,the amountofinformationbytimeunitoﬀeredtotheuser(seeTable1). Currently,theaccesstotweetsisallowedusingthefollowingmechan isms:

(4)

Thesethreeoptionsaremutuallyexcluding;itisnotpossibletolook for more than one option simultaneously. When it is necessary to monitorusersandkeywords,thedeveloperhastoresorttosolutions that processinparallel,fusing,andeliminatingduplicatedmessages later. Forinstance,thiscase isrelativelyfrequentinthetrackingof electoralcampaignsinwhichitisinterestingtoknowtheconversation around the candidates (users) and the mottos for the campaign (representedasasetofkeywords).

3. State-of-the-art

Sinceitsinceptionin2006,theamountofdataavailableonTwitter hasgrownexponentially,producing500milliontweetsdaily.Thereis also anincreasingnumberof toolsavailabletoprocessdatacoming fromTwitterinordertoobtainvaluableinformation.

Ourcomparisonstartsbyanalyzingdifferentapproachesfordata collection.ThemainapproachesthatcanbecomparedwithTHoarder includeTwapperKeeper(OBrien,2011)TwitterTap(Kranjc,2014),and TwitterStrean toMongoDB (Del Fresno, 2012). The maindifference amongthesebasiccrawlersandTHoarderisthatTHoarderisneutral, referredtothetypeofstoragerequired(seeTable2).Mostapproaches storetweetsinlargedatabasesthatcanbeprocessedlater.Incontrast, THoarderstoresthisinformation ina plainfilesystemthat canbe accessedlaterforofflineanalysis.

In the stateoftheart (Goonetilleke et al., 2014), there are a numberofplatformsthatcomparewithTHoarder:

•

TwitterEcho. (Bošnjaketal.,2012).TwitterEchoisaTwitter crawlerwithafaulttolerantarchitecture,withsupportforreliable datastorage,andamodularcrawlinginfrastructure.

•

Twitter Zombie(Blacketal.,2012).TwitterZombieisameth odologicalapproachandtechnicaltoolforTwitterdatacollection.It hasbeenusedintheUSRepublicanPartyPresidential debatefor SouthCarolinaelectionstoanalyzetheirdata.

•

Twitter Analytics(Stavrakas and Plachouras,2012). Itpro videstheapplicationwitha3Layerinfrastructure:acrawling,an integrationmodelingandadataanalysislayers.Itsarchitectureis intendedforcampaignanalyticsandopinionmining.

•

Trendminer(PreotiucPietroetal.,2012).Thelastapproachis trendminer:anopensourceframeworkforeﬃcienttextprocessing. Itisbasedonthemapreduceprocessingmodelandhasacomplex processingengine.ProposedTHoardershareswithTwitterMiner theideaofhavinganinfrastructurewhichisabletoprocessahuge numberofdatainaproperway.

ComparingTHoarderwiththeotherrelatedframeworksforTwitter (seeTable3),itmayconcludedthatitsmainnoveltyisitsflexibleand lightweightinternalarchitecture,whichallowsforefficientmanagement ofbigandlonglastingTwitterexperiments.Thisstoragearchitecture allowsuserstoexploredataoffline,beforeitisgoingtobeprocessedto produceananalyticresult.Noneofthepreviousframeworkshavethis type of support, because they were designed as simple collection infrastructuresorasspecificanalytictools.Previousefforts

aremorefocused ontherequirements ofspeciﬁcanalytics,while T Hoarderismoreneutralonthetypeofanalysiscarriedout. 4. T-Hoarder in a nutshell

This section describes the architecture of THoarder, departing fromthehighlevelarchitectonicperspective tothediﬀerentaspects relatedtoitsinterfaces.

4.1. Overview

THoarderwasconceivedtostoretweetsaboutcertainhottopicsin whichthediﬀerenttypesofpropagationpatternsmaybeanalyzedin largedatasets.Atthemoment,THoarderisabletoshowtheevolution intimeofasetofsubjectsofsocialinterestlikecitizenmobilizations, political opinionson crisis, scandals, and corruption(Congosto and Aragón,2012;Peñalópezetal.,2014;Congosto,2014,2015).TheT Hoarderplatformstorestweetsbythematiclinesanditautomatically processestheminthreeaxis:time,space,andrelevance.Thetimeaxis allowsconsideringthetemporalevolutionofasetofindicatorslikethe most mentioned tweets and more active proportion of retweeted messages, popular users, popular hashtags, and the most frequent words.Thespaceaxislocatestweetsgeographicallyandtherelevance shows the most widespread messages. Also, THoarder has an interactive graphical interface that easies navigation in these three axes.Theplatformwasdesigned withaminimalistarchitecturethat avoidsthedependencyonothersoftwarepackages.

THoarder optedfor UNIX as a reference operatingsystem and Pythonasamainprogramminglanguagefortheenvironment.Theuse ofdatabaseswasdiscardedandTHoarderdecidedtorelydirectlyon theUNIXﬁlesystembecause:

1.It can run with minimum execution environments such as a RaspberryPI.7

2.Itisabletofacilitatethe transferof thedatafromone server to another.Whenthedatabasesarelarge,thebackup/restoreactivities areproblematic.

3.Ithasnotdeﬁned(apriori)anydatamodeling,notknowingthetype ofdiﬀerentinformationthatisgoingtobeprocessed.

4.Itenableseasyintegrationofdiﬀerentdatasets.

5.Itdoesnotrequireaccesstorandominformationsincetheproces singofthedataissequentiallydeﬁned.

A three layer disconnected architecture was deﬁned to avoid that the execution of a layer interferes in other elements of the architecture. The communication between these layers is always done usingﬁles.

Table 2

Comparison among the main methods to access Tweets.

Data Dependency Open

Collection from Twitter

on DB software Source

TwapperKeeper Yes Yes Yes

Twitter-Tap Yes Yes Yes

twitterstream-to-mongodb

Yes Yes Yes

T-Hoarder(our approach)

Yes No Yes

Table 3

Comparison among the different management frameworks.

Type of Integrated Complexity

analytics visualization

Twitter-Echo (Bošnjak et al., 2012)

Tweets collection No Med Twitter- Zombie (Black

et al., 2012)

Tweets collection Yes Med Twitter Analytics

(Stavrakas and Plachouras, 2012)

Opinion mining No Med

Trendminer (Preotiuc-Pietro et al., 2012)

Tweet Analysis No High

T-Hoarder Tweets collection, location and spread

Yes Low

(our approach)

(5)

The functional division in T Hoarder is the following: 1. Layer 1: Data collection and storage

2. Layer 2: Data processing 3. Layer 3: Visualization

Fig.1showsthefunctionallayersandhowtheyexchangeinforma tion. Layer 1 collects data from the Twitter API and stores it in compressedpacketswhenthesizereaches100MB.Layer2processes thecompressedpacketsseparately.Itisonlynecessarytoprocessthem once. Additionally it processes the latest information not yet com pressedbyhavingasizeoflessthan100MBandintegratestheresults of alltheinformationprocessed.Layer3displaystheresults onthe web.

ThesourcecodeofTHoarderispublicallyavailableinaGitHub repository(Congosto,2015).

4.2. Layer1:datacollectionandstorage

Following the mechanism imposed by Twitter APIs, THoarder registers an application and a set of users of such application on Twitter.Inordertobe abletoaccesstheTwitterAPIsbymeansof OAuth,itisnecessarytocreatetheaccesskeysfortheapplicationand the users.Thesekeys aregenerated inthe componenttweetauth, available onCongosto (2015),andstored inﬁles sothat thecompo nentsofTHoardercanusethemtoaccesstotheAPIs.

THoarderusestheStreamingAPIinsteadoftheRESTAPIforthe followingreasons:

1. It is the method most adapted to obtain data in real time with the only limitation of the maximum rate of 50 tweets per second. The alternative would be the use of the method GET/search/tweets

of the REST API that force periodic requests to obtain the messages with the consequent diﬃculty in determining the sampling fre quency.

2. T Hoarder is designed to cope with long lasting Twitter experiments. In another type of short duration experiments (days) and in which it is necessary to go back to a previous date, such as the Trending Topic (TT) it is not necessary the use of T Hoarder. It is more reasonable

to obtain tweets with method GET/search/tweetsof the REST API with which they will be possible to be recovered up to seven previous days and to obtain a complete dataset.

Intheoptions of theStreaming API, THoarder preferablyuses keywords and users.Geolocation is information that lacks thematic contextandrepresentsasmallsampleoftweets(e.g.1.5%inSpain).

Tweetsarestoredwithinstancesofthetweetstreamingcompo nent (available on Congosto (2015)),executed in parallel. The API providestweetsinJSONformat(Crockford,2016).Then,thedataare transformedto ﬂat text inwhichthe information for each tweet is storedinalinewithtabulatorseparatedﬁelds.Thisallowstoreadthe messagesthataretakenwiththisfacilityandtointroducethemina spreadsheet.

Ofallthereceivedinformation,theselecteddatathatisstoredand usedintheanalysislayerare:

1.id tweet: Identiﬁer of a tweet. It is an increasing number that is sequentially assigned to each message.

2.timestamp: date and GMT hour of a tweet. 3.@author: the screen name of the author of tweet. 4.text: text of a tweet.

5.app: Application from which a tweet has been published. 6.id author: Author identiﬁer. It is an increasing number assigned

by Twitter to each user when they log in the system.

7.followers author: Number of followers at the moment of the publication.

8.following author: Number of users followed at the moment of a publication.

9.statuses author: Number of tweets previously published. 10.location: location declared in the proﬁle of the user.

11.url: if a tweet contains a URL, it is this data. On the contrary it stores anullvalue.

12.geolocation: the coordinates that identify the location of the user if the tweet is geo located.

13.names: Name provided by the user 14.bio: description of the user.

15.url media: It contains a url whether the tweet has multimedia information; on the contrary stores to anullvalue.

(6)

16.type media: type of information multimedia (e.g. photo or video). In case of a non existing media its value isnull.

17.lang: language of the tweet.

Data storage

The platform has been organized in a structure of predefined folders. Folder andfile names use predefined prefixes and suffixes to easy the location of the stored information. Fig. 2shows how code, data, access keys, and the web zone are organized. Code is located in thescriptsdirectory, data is placed in thestoredirectory and in each experiment's folder, the access keys for the Oauth protocol are stored in thekeysdirectory andfinally the web prepared data is located in the templatesandwebdirectories.

Streams of each experiment are stored in the store directory in packages of ﬁles. Theﬁles are generated with the following pattern:

streaming experiment x.txt name, (x: 0, n).Thefirstfile is numbered with zero and when it reaches the size of 100 MB is compressed and a new file with an increasing numbering is created. For textfiles, the compression is very efficient and the size of the data typically reduces to the third part.

The context of each experiment is stored in a folder under thestore directory. The context includes either aﬁle with the list of keywords, users or locations to be monitored as processed data.

4.3. Level 2: data processing

The data processing is made of independent form to the capture and storage to avoid premature tweet losses. Therefore, the elaboration of the data is made of periodic form by means of scheduled activities (running on UNIX'scron).

Large experiments could generate datasets that would turn out very expensive to process. Nevertheless, the method of storage of streams in packages of ﬁles delimited by size makes viable to process them by parts, and to integrate results later. This approach has the following advantages:

1. It is feasible to process compressed data directly without having to uncompress it.

2. It is possible to process the different packages fromfiles in parallel. 3. It is not necessary to process again a package already processed; only the new data from the last iteration has to be processed at each step. In theresourcesfolder, different resources are stored to process the data, for example: tables of names, geo located localities, and diction aries to classify tweets.

In thewebdirectory,there is a directory for each experiment in which the processed data will be stored to be presented/displayed in the graphical web interface.

Processing a package

For each package, the following operations are made: 1. Toﬁlter false positives

2. To extract indicators 3. To extract relevance 4. To extract location 5. To generate the package 4.3.1. Falsepositives ﬁltering

Occasionally,falsepositiveshappenbecausethetermsusedinthe queriescontainambiguouswordsorthewordsappearinatweetina diﬀerentorderthanexpected.Forinstance,aquery“metroMadrid”

(Madridunderground)canreturnfalsepositivesbecauseMetroisaTV channelthatbroadcastsRealMadridandAtléticodeMadridfootball matches.Falsepositivesaredetectedusingﬁlterpatternsstoredina

filecalledfilter.txt.Thisfilecontainsasetofwordsorexpres sions that do not correspond tothe expected context. The filter is carriedoutinascriptcalledtweetsselectfilter.py,availableon

Congosto(2015).

4.3.2. Extractingindicator

Toobservetheevolutionofanexperimentaseriesofindicatorsare computed.Suchindicators areexposed laterinthetimeaxis.These indicatorsprovidecluesontheparticipationandpublicationwayofthe messages:

1.Number of tweets: amount of tweets gathered.

2.Number of RTs: amount of tweets spread by means of the broad casting mechanisms.

3.Number of replies: amount of tweets answering to another tweet. 4.Number of mentions: amount of tweets that contain mentions. 5.Number of unique users: amount of diﬀerent users have tweeted. 6.Number of new users: amount of users who tweeted for theﬁrst

time.

7.Top hashtags: for each one of the most mentioned hashtags, the amount of times that appears in the tweets.

8.Top words: for each one of the most frequent words (they do not consider ending words), the amount of times they appear in the tweets

9.Top mentioned users: for each user, the amount of times that appears in the tweets.

10.Top active user: for each one of the most active users, the amount of tweets that the user has published.

These indicators are extracted with the tweets counter.py script,availableonCongosto(2015).

4.3.3. Dealingwithrelevance

In THoarder, relevance is measured by the diﬀusion of the messages.The messagesspread becausetheycatch theattention of otheruserswhochoosetogivevisibility.OnTwitter,thepropagationof messages is made by means of the retweeting (RT). The RT is a conventioncreatedinthebeginningsofTwitterbyuserswhowantto sharetweetswith theirfollowers. Since2009 Twitterincludeda RT button that did the same but automatically, which facilitated the propagationof messagesalot. Peoplegenerally spreadtweetswhen theyagree.Insomesense,itcouldbeinterpretedasapositivevotefor themessage.

THoarder discarded using the data of the number of RTs that providestheStreamingAPIforbeingdynamiccharacteristicsthatvary withtime.THoarderdetectstweetsdiﬀusioncomparingthesimilarity

Fig. 2.T-hoarder predeﬁnedﬁlesystem structure for code, data, access keys, resources

(7)

1. Identiﬁer of tweets 2. Date and hour of a tweet 3. Author of a tweet 4. Text of a tweet 5. RT count

This is how T Hoarder obtains the relevance.

for each tweet get hour, day ifﬁrst tweet then

last hour hour, last day day if tweet in RT global: then increase RT global[tweet].count else RT global[tweet].count 1 if tweet in RT day: then: increase RT day[tweet].count else: RT day[tweet].count 1

if hour > last hour or len(RT day) > 15000 then

RT global 2000 tweets with more RTs RT day 2000 tweets with more RTs last hour hour

if day > last day then

Order descending RT day by count and writeﬁle empty RT day

last day day

Order descending RT global by count and writeﬁle

InTHoarder,themostwidespreadtweetsareobtained withthe

tweets talk.pyscript,availableonCongosto(2015). 4.3.4. Extractinglocation

Thelocationoftweetscanbeobtainedintotwoways.First,itisthe declared location in the profile of the user. This data could be incompleteoritmaycontainthenameofafictitiouslocation.Forthis reason, it isnot possible tolocate allthe messages geographically. However,itispossibletolocateahighpercentageoftweets(between 70%and60%).Thesecondoptionistoprovidegeolocalizedtweets.In thiscasethepercentageismuchsmaller(inSpain,1.5%).T Hoarder hasaresourceinwhichallthemunicipalitiesofSpainareclassifiedby state,provincesandgeographicalcoordinates(longitudeandlatitude). Withthesedata,tweetscanbegeolocatedonamapaccordingtoits declaredlocationandalsoitispossibletoaggregatethembyprovince or state. Supportfor locationsfromother countries usingthesame approach. For the geolocalization, simply the coordinates of each tweetareextracted.

Everyday,tweetslocatedbyuserproﬁleandgeolocatedtweetsare stored.Inbothcasestherecordsstoredare:

1. Tweet identiﬁer

2. Date and hour of the tweet 3. Author of the tweet

4. Text of the tweet 5. Tweet coordinates

The locations are obtained with the tweets location.py script,availableonCongosto(2015).

4.3.5. Generatingthestateofapackage

Whenapackageisbeingprocessed,itstoresinaﬁledenominated

experiment x status.txt,itsstateinformation,containing: 1.Initial date: date of the older tweet

2.Last date: date of the most recent tweet

3.State: state of the process of the package. It takes the values: semi processed or processed.

4.Last processed tweet: identiﬁer of last processed tweet. Length of the package: packet size when it was processed

5.Number of tweets: amount of tweets inside the package 6.Running time: total time running in the system 4.3.6. Results integration

Results are calculated on a per day basis, reason why integration is something as simple as the concatenation of results. Also, it is necessary to consider a possible edge eﬀect: as the partition of dataset in packages is made by size (100 K) it can be left a day divided in two diﬀerent packages.

In the case of tops (words, hashtags, mentioned, and active users), although the top 1000 items are stored, only the top 10 are displayed. The top 10 results are stored in the interchangewebdirectory,so that the web server can access them.

This is the process for join results.

For each data packet

Store in memory entity counters by day (one day can be in two different packages)

Store in memory top 1000 of entities (one day can be in two different packages)

Store in memory global RTs

Store in memory RTs per day (one day can be in two different packages)

Store in memory locations

Generateﬁle of entity counters by day

Generateﬁle of top entities, reducing the top 1000 to 10

Generatefile of global RTs Generatefile of RTs per day Generatefile Locations

The dataintegration isdone with the join results.pyscript, availableonCongosto(2015).

4.4. Level3:visualization

Toanalyzetheevolutionoftherecoveredinformation,webpanels weredesignedtovisualizethediﬀerenttypesofformats.Thetechnol ogiesusedforthesepanelsare:

1. Framework bootstrap of HTML, CSS stylesheets and JavaScript. They allow giving structure, style, and interactivity with the diﬀerent elements of a panel.

2. The Digraphs Library is used for the temporary graphs. It has options to favor interactivity, including facilities to make zoom lens and to invoke functions from a point of the graph to contextualize information.

3. Google Maps for the visualization of maps. of messages and considering thestructure of the RT. Therefore, it

detectsbothmanualRTsandautomaticRTs.Thediﬀusionofmessages iscalculatedperdayandforalltheperiodoftweetscapture.Inthisway itisknownthemostrelevantineachdayandthemostoutstandingof thecompletedataset.Thedatastoredforthemostspreadtweetsare:

(8)

TheTHoarderpanelsarebuiltasapuzzle,usingthefeature < iframe> thatembedsdiﬀerentsmallpiecesofHTMLgeneratedfroma set oftemplatescustomized. Thepaneliscreatedautomaticallywith theshellscript:make panel.py,availableonCongosto(2015).

4.4.1. Templateofthemainpage

Itisagenerictemplatewiththestructureofawebpanelinwhichit isdistinguishedthedescriptionoftheexperimentandtheaccesstothe diﬀerenttimegraphsormaps.Thistemplateiscustomized foreach experiment by replacing the token @experimento with its speciﬁc name.

Fig.3showshowinformationispresentedontheweb.Fourzones havebeenhighlighted:navigationbar,experimentdescription,menu optionsandinteractivegraphs.

1. A navigation bar from which it is possible to access to the main menu of T Hoarder. This bar is common to all panels.

2. Description of the dataset and the entities that are being monitored. This is a frame with textual information.

3. Menu options to select diﬀerent views from the information: users, type of tweets, more frequent terms and hashtags, more mentioned and more active users, location of messages, and help.

4. Interactive graphical representation of the information selected in the menu. There are two types of graphs: the time graphs which show the evolution of the diﬀerent indicators, and the geographic ones plotted with a map.

4.4.2. Templatefortimegraphs

Thisgenerictemplatefortimegraphsisusedforseveraltypesof entities,forthisreasonitisinstantiatedforeachofthemreplacingthe tokens @experimento and @data ﬁle by their speciﬁc values. The schemeofthistemplateisavailableonCongosto(2015).

The time graphs contain two parts:

1.The interactive graphic area shows the evolution of values of a selectedentity.Movingthemousethroughthegraphthenumerical valuesof theelementsofthelegendareshown.Inordertomake zoomlens,leftbuttonofthemousehastobepressedanditcrawls. Theinterface alsoallowseliminatingthezoomlens withadouble click.AnexampleofinteractivegraphsisshowninFig.4

2.Mostrelevanttweetsforanentity.Default,displaythemostrecently

spread.Itispossibletoshowthemostpropagatedfortheexperi mentpressinginthebutton“spread”.Inordertodiscoverthemost populartweetsinaday,itisonlynecessarytoclickonthedateofthe graphontheleft.Inallcases,themessagesarepagedingroupsof fouranditispossibletoshowamaximumofforty.Fig.5showsan exampleoftweetsrelatedwithaninteractivegraph.

4.4.3. TimePanelTemplateforthelocationoftweets

There isa HTML template that allows visualizing a map (with GoogleMaps)inwhichthefrequencyoftweetsbyareasisdisplayedby meansofa heatmap.Thistemplateisinstantiated byreplacingthe token“@dataﬁle”bythenameoftheﬁlewiththedata.Fig.6 shows anexampleofaheatmapwithgeographicareaswithmoredensityof tweets.ThistemplateisavailableonCongosto(2015).

Fig. 3.Diﬀerent areas on the t-hoarder web panel: navigation bar, experiment description, menu options and interactive graphs.

(9)

thistypeofinfrastructure.Italsoshowsthebenefits stemmedfrom dividing the data into different blocks, called packages in the T Hoarderjargon.Tothisend,theevaluationhasbeencarriedoutona singlemachine(Table4)whichcomputesthedifferentperformanceof differentstreamsof tweets.Thistype ofinfrastructureisenough to determine bands of performance for THoarder when using it to processmediumsizedatasets.

5.1. Benchmarkcharacterization

Thefourdatasetanalyzed(seeTable5)correspondstoSpanishand international microblogging communities. The ﬁrst, labeled

Diputados, contains the impact of tweets of the members of the Spanishparliament;itisalargedataset(3years)withasmalltweeting frequency(8ktweetsperday).Theseconddatasetreferstotweetswith anURLthatbelongstothemainnewspaperinSpain:El Pais;itcan be classiﬁed as a medium duration time (>1 year) and average frequency among tweets (27k tweets perday). The third dataset is international,labeledEcology,referstotweetsthatcontainingasetof wordsrelatedwith sustainableenvironment.Ithaslongduration(3 years)withaveragedatarate.Lastly,theforthdatasetreferstoEbola

refereeingtoahighertweetrate(0,5milliontweetsdaily)inashort time(80days).

The datasets may also be considered in terms of the packages necessary toprocesseach datasetand thestoragespace required to storethem(seeTable6).Intermsofnumberofpackages,Diputados

requires26 packages,El Paistakes46,Ecology100, andEbola

169.Thenumberofpacketshasadirectconnectionwiththenumberof GB required to store the dataset information, which varies from 2.77GbforDiputadostothe16GbrequiredforEbola.

Each diﬀerent dataset takes a diﬀerent amount of time to be processedbytheTHoarderengine.Diputadosisthesmallerdataset andtakes4hand13min.El Paistakes4.5hofcomputationtime. Ontheotherhand,Ecology andEbolaarelargedatasetandtake morethanhalfadaytocomputethedataset.

5.2. Packageprocessingtimes

Oneofthemainsourcesofcomputationaloverheadisinthetime requiredfor processingeach package.Fortheproposeddataset, the amount of timedepends onthe typeof processing carried out and mainlyontheamountoftweetsthathavetobeprocessed.

Fig.5.Exampleoftweetsrelatedwithaninteractivegraph. 4.4.4. Templateforgeolocating

ItisanHTMLtemplatethatenablesvisualizinginamapofGoogle Mapsthegeolocatedtweets.Thistemplateiscustomizedbyreplacing the “@data ﬁle” tokenwith thename of thedata ﬁle. Placingthe mouse over certain location it is possible to read the messages associatedtosuch location.Fig.7showsan exampleof a mapwith thegeolocatedtweetsofanexperiment.Thistemplateisavailableon

Congosto(2015).

5. Empirical evaluation

ThegoaloftheexperimentalstudiescarriedoutwithTHoarderis toprovideempiricalevidenceontheperformanceonemayexpectfrom

(10)

data, the most expensive calculation is the counters and the less expensive is the location information:

•

FortheDiputadosdataset(Fig.8),theresultsshowthediﬀerent costsassociatedtothescriptsinchargeofprocessingtweets:tweets counter.py,tweetstalk.py,andtweetslocation.py.In thecaseoftweetscounter.py,thetimerangesfrom4minand22 5minand40forlargerpackages.Thetweetslocation.pyscript movesform2minand203min,forthesamedataset.Lastly,the

tweetstalk.pyscriptisinthe1minand25srangetothe2min and20sforlargerdatasets.

•

FortheEl Paisdataset,theresultsareshowninFig.9Theresults showthatthetweets counter.pyscriptrequiresfrom2minand 40s to3:50, havingamorereduced amountoftweetbundledin each package than the Diputados dataset. For the tweets talk.pyscript,theoutcomemovesfromthe50stothe1min and10s.Lastly,thecodeinchargeoflocationtakesfrom1minand 26 2min.

•

Thethirddataset,Ecology,thedatashowsthesameperformance patterns(seeFig.10).Thetweets counter.pyscripttakesfrom 3minto4minand40s.Forthesamescenario,thescriptincharge ofcalculatinglocationsmovesfrom2minto3minand10s.Lastly, thelessheavyscriptisthetweets talk.pyscript,whichmoves

Fig. 7.Geolocated tweets on T-Hoarder.

Table 4

Characterizing the evaluation infrastructure.

Infrastructure CPU 2 Ghz (8 cores) Memory 8 Gigabytes OS Ubuntu T-Hoarder 0.0.9 Benchmarked application

Templates T-hoarder end-to-end tool-chain:

1.tweets_counter.py

2.tweets_talk.py

3.tweets_location.py

Number of tweets 10 44 million of tweets

Number of packets 26 169 packets of

100 Mbytes

Sincethesizeofeachpackageisﬁxedto100Mb,thenumberof tweetsof eachblockdependsonthestoredformatofthetweets.In olderdatasetslessinformationwasstoredandasaresult,itisincluded further data at the end of each registry in order that they were backwardcompatiblescripts.

For this reason in datasets like Diputados and Ecology the number of tweets by package moves between 300,000 to 400,000 whereasindatasetslikeEl PaisandEbola,therangeoftweetsper packagemovesfrom210,000to300,000.Thevariationofthenumber of tweetsindatasetswiththesameformathasdependencywiththe numberofURLs,geolocationinformation,andsizeoftheuser'sbios. Therefore,forthegivendataset,theaveragetimeforprocessinga tweetoscillatesin1.3 1.5msrange.AsitisshowninFigs.8 11,the runtimeseemstokeepsomelinearitywiththenumberoftweets.Inall

Table5

TypesofdatasetsevaluatedinT-Hoarder.

Data set (LANG) Total number of tweets Total Number of days From to Average Tweets/day

Diputados (ES) 10,100,380 1235 11-12-29 15-05-17 8177.68

El Pais (ES) 11,703,899 433 14-05-07 15-07-14 27,052.19

Ecology (EN) 37,219,566 1130 12-06-09 15-07-14 32,940.71

Ebola (EN) 44,550,169 80 14-08-04 14-10-24 555,241.75

Table 6

Performance for each dataset in T-Hoarder.

Data set Packets Storage space (Gb) Processing time

(hh:mm:ss)

Diputatos 26 2.77 4:13:54

El Pais 49 5.15 4:33:02

Ecology 100 10.28 13:57:14

(11)

from 1 min to almost 2 min.

•

Lastly,thefourthdatasetreferstoEboladatasets,showninFig.11. Again,thesamepatternsshowninthepreviousdatasetarevalidfor theexperiment.Thetweets counter.pyscripttakesfrom3min to5minperpacket.Thelocationscriptmovesfrom2minto3min and30s.Lastly,thetweets talk.pyscriptmovesfrom1minto almost2min.

From these empirical results we can infer:

1.Runtimetoprocessapacketdoesnotdependonthesizeofthe datasetorthenumberofpacketsinwhichitisdivided.Thefourcases studiedhavediﬀerentsizesandtheexecutiontimeperpacketare similar

2.Theexecutiontimeofapackagedependsonthenumberoftweetsit contains.InFigs.9and11withfewertweetsperpackettheruntimes

3. The processing system is scalable because, regardless of the size of the dataset, the tweets are processed in small packages in which the runtimes are known: less than six minutes fortweets counter, less than three minutes fortweets locationand less than 2 min fortweets talk.

5.3. Joiningperformance

Thedivisionofdatacomeswithextraoverheadssincethedatahas to be merged after being partially processed in each package. The integrationtimeforpackagesdependsonthedurationofthedataset. Thisisbecausetheinformationisprocessedperdaysandthereforeit increasesthenumberofiterationsofthealgorithm.

For the datasets described in previous sections, it has been evaluatedthisoverhead.Theresults(inFig.12)showhowthesetimes

Fig. 8.Processing time per each package inDiputadosdataset in hh:mm:ss format.

Fig. 9.Processing time per each package inEl_Paisdataset in hh:mm:ss format.

Fig. 10.Processing time per each package inEcologydataset in hh:mm:ss format.

Fig.11.Processingtimepereachpackagein Ebola datasetinhh:mm:ssformat. arelessthanfourminuteswhereasFigs.8and10withmoretweets perpacketexceedthistime.

(12)

Fig.12.Processingforjoiningdatainhh:mm:ssformat.

areshorter, incomparison withthetotaltimerequired toprocessa packageofdata.Inthosecases,wheretheintervalofthedataislarger, likethecaseofDiputadosandEcologydatasets,thejoiningprocess needsmoretimeinspiteofhavingfewerpackages.However,thistime is much more reduced than the total time required toprocess the package.Speciﬁcally, theoverheadproducedbythejoinofpackages with respecttototalruntimeofpackagesisa0.33% inDiputados

(long period), a 0.16%inEl Pais(mediumduration),a 0.19%in Ecology(longperiod)and0.09%inEbola(lowduration).

5.4. Beneﬁtsstemmedfromsplittingpackages

Our last experiment analyzesthe beneﬁts in splitting tweets in severalpackages,comparingtheperformanceofthisstrategyagainst anotherequivalentstrategywhichconsistsinhavingasinglepackage. Amonolithicdatasethasbeenevaluated.Itconsistsof10packages whichcontainalarge amountof tweets.Thisdatasetismuchmore reducedthanthenumberofpackagesintroducedinprevioussections; our previous experiments refer to 26 169 packets. It has been calculatedtheexecutiontimefordiﬀerentpacketsseparately,andthe global dataset.Onthesedata, ithas beenlooked forthecorrelation betweenthenumberoftweetsprocessedandthethreecontributorsto theoverhead:tweets counter.py,tweets talk.py,andtweet s location.py.

TheresultsoftheevaluationhavegraphicallydisplayedinFig.13. Amongthethreescripts,tweets talk.pyisthemostaﬀectedbythe sizeofthepacketandtweets counter.pythescriptthathasless inﬂuencefromtheamountofdataprocessed.Thethreeruntimesfor scriptsmovesfrom0:05:31to0:00:59.Theaveragetimeofexecution of apackageis0:07:28.Becauseonceprocessedapackageitisnot necessarytoprocessitagain,itcanmeasuretheimprovementinthe executiontimeofthemodelbasedonpackagestoworkwithacomplete dataset.

Withalargepackagethecomputationtimeincreaseswiththesize of thepackage.Inour experiments,theprocessingof thedatasetis processedin1:04:43,morethan8xtimestheexecutingoftheindividual packages(seeFig.14toestablishtheempiricalevidence).Thistypeof resultseemstojustifythebeneﬁtsbyexploitingthedataintodiﬀerent

ﬁles inT Hoarder.This numbersaremuchmorerelevant whenthe number of tweets that have to processed increases, but it is also signiﬁcantwith10packages.

6. Conclusions and future work

Theanalysisofmicroblogsrequiresspeciﬁctoolsthathelpperform global analytics applied to this popular environment. This work contributeswithourparticularexperienceindesigningandevaluating atooltoperformanalyticsonmicroblogs.WiththeaimonTwitter,the

Fig. 13.Correlation among the number of tweets and time taken to processed them.

Time measured in hh:mm:ss.

Fig.14.Correlationamongthenumberoftweetsandtimetakentoprocessthem,witha singlepacketstrategy.Timemeasuredinhh:mm:ss.

articlehasdescribedacosteﬀectiveframeworkcalledTHoarder.The mainadvantageprovidedbythisframeworkisthepossibilityofusing anintegratedapproachtostore,process,andtodisplaypreprocessed data asan output directlyona navigator. The empiricalevaluation carriedoutwithseveraldatasets,revealedtheperformancedelivered bytheproposedapproachwhichisabletoprocessmillionsoftweetsin seconds.

Our future work in relationship with T Hoarder includes two mainresearchlines.First,theauthors are analyzing the advantages stemmed from the use of diﬀerent techniques of parallel and distributedcomputingas architectonicblocksuseful toreducethe totalcomputationtimeofourcurrentengine.Thesecondlinerefers tothedevelopmentofanalternativeapproachbyusingcommonoﬀ

the shelf big data engines (based on Storm (Apache Storm, 2014; MarzandWarren,2015;BasantaValetal.,2015;BasantaValetal., 2016),eﬃcientmapreducestrategies(Anjosetal.,2015;Leeetal., 2013),and Hadoop(Zikopoulosand Eaton, 2011) to run stream analytics.

(13)

〈https://storm.incubator.apache.org〉.

Asur,S.,Huberman,B.A.,2010PredictingtheFutureWithSocialMedia,Computing. Barberá,P.,Rivero,G.,2012.DesigualdadenladiscusiónpolíticaenTwitter.Congr.

ALICE.

Basanta-Val,P.,Fernández-García,N.,Wellings,A.J.,Audsley,N.C.,2015.Improving thepredictabilityofdistributedstreamprocessors,futuregenerationcomputer systems.SciencieDiret52,22 36.

Basanta-Val,P.,Audsley,N.C.,Wellings,A.,Gray,I.,Fernandez-Garcia,N.,2016. Architectingtime-criticalbig-datasystems.In:IEEETransactionsonBigData,vol. PP,no.99,pp.1 1.〈http://dx.doi.org/10.1109/TBDATA.2016.2622719〉. Black, A., Mascaro, C., Gallagher, M., Goggins, S., 2012. Twitter zombie : architecture for

capturing , socially transforming and analyzing the Twittersphere. In: Proceedings of the 17th ACM Int. Conf. Support. Gr. Work, pp. 229 238.

Bollen, J., Mao, H., Zeng, X., 2011 Twitter mood predicts the stock market. Computer (Long. Beach. Calif). pp. 1 8.

Bollen, J., Gonçalves, B., Ruan, G., Mao, H., 2011. Happiness Is Assortative In Online Social Networks. pp. 1 17.

Bošnjak, M., Oliveira, E., Martins, J., Mendes-Rodrigues, E., Sarmento, L., 2012. TwitterEcho a distributed focused crawler to support open research with Twitter data. In: Proceedings of the WWW 2012, 21st Int. Conf. Companion World Wide Web, pp. 1233 1239.

Cha,M.,Haddadi,H.,Benevenuto,F.,Gummadi,P.K.,2010.Measuringuserinﬂuencein twitter:Themillionfollowerfallacy.Icwsm,10(10-17),30

Congosto,M.,2015.T-Hoardersource-code.Available: 〈https://github.com/congosto/t-hoarder〉.

Congosto,M.,2015.EleccionesEuropeas2014:ViralidaddelosmensajesenTwitter. Rev.Redes26,2352.

Congosto,M.,Aragón,P.,2012.AnálisisDeLasElecciones20N,ALICE2012. Congosto,M.,2014.Twittercomofuenteparaconocerlaopiniónpública.In:Lasnuevas

tecnologíasaudiovisualesfrentealosprocesostradicionalesdecomunicación,C.A. deC./64CAC,Ed.pp.117 142.

Conover,M.D.,Ratkiewicz,J.,Francisco,M.,Gonçalves,B.,Flammini,A.,Menczer,F., 2010.PoliticalpolarizationonTwitter.Networks,8996.

Crockford,D.,2006.Theapplication/jsonmediatypeforjavascriptobjectnotation. JSON.

DeDomenico,M.,Lima,a.,Mougel,P.,Musolesi,M.,2013.Theanatomyofascientiﬁc rumor.Sci.Rep.3,2980.

DelFresno,G.,2012.Twitterstream-to-mongodb.Available:〈https://github.com/ gdelfresno/twitterstream-to-mongodb〉.

Dodds,P.S.,Harris,K.D.,Kloumann,I.M.,Bliss,C.A.,Danforth,C.M.,2011.Temporal patternsofhappinessandinformationinaglobalsocialnetwork:Hedonometrics andTwitter.PloSone,6(12),e26752

Gayo-Avello,D.,2011.Don’tturnsocialmediaintoanother‘LiteraryDigest’.Commun. ACM54(10),121.

Goonetilleke,O.,Sellis,T.,Zhang,X.,Sathe,S.,Goonetilleke,O.,Sellis,T.,Zhang,X., Sathe, S., 2014. Twitter analytics: a big data management perspective twitter analytics : data management perspective. ACM SIGKDD Explor. Newsl., vol. 16, no. 1, pp. 11 20.

Gudivada,V.N.,Baeza-Yates,R.A.,Raghavan,V.V.,2015.BigData:Promisesand Problems.IEEEComputer,48(3),pp.20-23

Honey,C.,Herring,S.C.,2009.Beyondmicroblogging:conversationandcollaboration viaTwitter.In:Proceedingsofthe42ndHawaiiInternationalConference.System Sciences2009.HICSS’09.pp.1-10.IEEE,pp.1 10.

Huberman,B.A.,Romero,D.M.,Wu,F.,2008.Socialnetworksthatmatter:Twitter underthemicroscope.Computing,1 9.

Java,A.,Song,X.,Finin,T.,Tseng,B.,2007,August.Whywetwitter:understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis pp. 56-65. ACM

Kranjc,J.,2014.Twitter-Tap.Available:〈https://github.com/janezkranjc/twitter-tap〉. Lee,D.,Kim,J.S.,Maeng,S.,2014.Large-scaleincrementalprocessingwithMapReduce.

Futur.Gener.Comput.Syst.36,6679.

Leiba,B.,2012.Oauthwebauthorizationprotocol.IEEEInternetComput.1,74 77.

Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C.,Byers,A.H.,2011. Bigdata:Thenextfrontierforinnovation,competition,andproductivity Marz,N.,Warren,J.,2015.PrinciplesandBestPracticesofScalableRealtimeData

Systems.ManningPublicationsCo..

OBrien,J.,2011.TwapperKeeper.Available:〈https://github.com/540co/ yourTwapperKeeper〉.

Peña-lópez,I.,Congosto,M.,Aragón,P.,2014.JournalofSpanishculturalstudies SpanishIndignadosandtheevolutionofthe15MmovementonTwitter:towards networkedpara-institutions.J.SpanishCult.Stud.,37 41.

Preotiuc-Pietro,D.,Samangooei,S.,Cohn,T.,Gibbins,N.,Niranjan,M.,2012,June. Trendminer: An architecture for real time analysis of social media text. In Proceedings of the workshop on real-time analysis and mining of social streams Stavrakas,Y.,Plachouras,V.,2012.Aplatformforsupportingdataanalyticsontwitter:

challengesandobjectives.In:CEURWorkshopProceedings,vol.895,no.Ict 270239,pp.1 6.

Zikopoulos,P.,Eaton,C.,2011.UnderstandingBigData:AnalyticsForEnterpriseClass HadoopAndStreamingData.McGraw-HillOsborneMedia.

Acknowledgements

ThisworkbeenpartiallysupportedbyHERMESSMARTDRIVER (TIN201346801C42R)andAUDACity(TIN201677158C41R). References

Anjos,J.C.,Carrera,I.,Kolberg,W.,Tibola,A.L.,Arantes,L.B.,Geyer,C.R.,2015.MRA+ +:schedulinganddataplacementonMapReduceforheterogeneousenvironments. Futur.Gener.Comput.Syst.42,22 35.