Big Data Analytics for Wireless and Wired Network Design: A Survey

(1)

Contents lists available at ScienceDirect

Computer

Networks

journal homepage: www.elsevier.com/locate/comnet

Review

article

Big

data

analytics

for

wireless

and

wired

network

design:

A

survey

Mohammed S. Hadi

∗

, Ahmed Q. Lawey

, Taisir E.H. El-Gorashi

, Jaafar M.H. Elmirghani

SchoolofElectronicandElectricalEngineering,UniversityofLeeds,UnitedKingdom

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received16June2017 Revised13January2018 Accepted16January2018

Keywords:

Bigdataanalytics Networkdesign Self-optimization Self-conﬁguration Self-healingnetwork

a

b

s

t

r

a

c

t

Currently,theworldiswitnessingamountingavalancheofdataduetotheincreasingnumberofmobile networksubscribers,Internetwebsites,andonlineservices.Thistrendiscontinuingtodevelopinaquick and diversemannerintheformofbig data.Bigdataanalyticscan processlargeamounts ofrawdata and extract useful,smaller-sizedinformation,whichcan beused bydifferentparties tomakereliable decisions.

Inthispaper,weconductasurveyontherolethatbigdataanalyticscanplayinthedesignofdata communicationnetworks.Integratingthelatest advancesthatemploybig dataanalyticswiththe net-works’control/trafficlayersmightbethebestway tobuildrobustdatacommunicationnetworkswith refinedperformanceandintelligentfeatures.First,thesurveystartswiththeintroductionofthebigdata basicconcepts,framework,andcharacteristics.Second,weillustratethemainnetworkdesigncycle em-ployingbig dataanalytics.Thiscyclerepresentstheumbrellaconceptthatunifiesthesurveyedtopics. Third,thereisadetailedreviewofthe currentacademicandindustrialeffortstowardnetworkdesign usingbigdataanalytics.Forth,weidentifythechallengesconfrontingtheutilizationofbigdataanalytics innetworkdesign.Finally,wehighlightseveralfutureresearchdirections.Tothebestofourknowledge, thisisthefirstsurveythataddressestheuseofbigdataanalyticstechniquesforthedesignofabroad rangeofnetworks.

1. Introduction

Networks generate traﬃc in rapid, large, and diverse ways, which leads to an estimate of 2.5 exabytes created per day [1] . There are many contributors to the increasing size of the data. Forinstance,scientiﬁcexperimentscangeneratelotsofdata,such as CERN’s Large Hadron Collider (LHC) that generates over 40 petabyteeachyear [2] .Social media alsohasits share,withover 1billion users, spending an average 2.5h daily, liking, tweeting, posting,and sharing their interests on Facebook and Twitter [3] . It is without a doubt that using this activity-generated data can affectmanyaspects,suchasintelligence,e-commerce,biomedical, and data communication network design. However, harnessing thepowers ofthisdataisnot an easy task.Toaccommodate the dataexplosion,data centers arebeing builtwithmassive storage andprocessing capabilities, an example of which is the National SecurityAgency(NSA)Utahdatacentrethatcanstoreupto1 yot-tabyteofdata [4] ,andwithaprocessingpowerthat exceeds100

∗ _{Corresponding}_author.

E-mail addresses: [email protected] (M.S. Hadi), [email protected] (A.Q. Lawey),[email protected](T.E.H. El-Gorashi),[email protected]

(J.M.H.Elmirghani).

petaflops [5] .Duetotheincreasedneeds toscale-updatabasesto datavolumesthatexceededprocessingand/orstoragecapabilities, systemsthatran oncomputerclustersstartedtoemerge.Perhaps thefirstmilestonetookplaceinJune1986whenTeradata [6] used the first parallel database system (hardware and software), with oneterabytestoragecapacity,inKmartdatawarehousetohaveall their business datasaved andavailable forrelationalqueries and businessanalysis [7,8] .OtherexamplesincludetheGammasystem of the University of Wisconsin [9] and the GRACE systemof the UniversityofTokyo [10] .

Inlight oftheabove,theterm“Big Data” emerged,anditcan be defined as high-volume, high-velocity, and high-variety data that providessubstantial opportunitiesforcost-effective decision-makingandenhancedinsightthroughadvancedprocessingwhich extracts information andknowledge fromdata [11] .Anotherway to define big data is by saying it is the amount of data that is beyond traditional technology capabilities to store, manage, and processinanefficientandeasyway [12] .Bigdataisalreadybeing employed by digital-born companies like Google and Amazon to helpthesecompanieswithdata-drivendecisions [13] .Italsohelps in the development of smart cities and campuses [14] , as well as in other fields like agriculture, healthcare, finance [15] , and transportation [16] .Bigdatahasthefollowingcharacteristics:

https://doi.org/10.1016/j.comnet.2018.01.016

(2)

Table1

Variousbigdatadimensions. No.

of Vs

References Dimensions(Characteristics)

Volume Velocity Variety Veracity Value Variability Volatility Validity Complexity

3Vs [25–31] √ √ √

4Vs [4,32–34] √ √ √ √

[35–39] √ √ √ √

5Vs [3,11,21,40,41] √ √ √ √ √

6Vs [20,22,24,42] √ √ √ √ √ √ √

7Vs [23,43] √ √ √ √ √ √ √

1- Volume:Thisisarepresentationofthedatasize [17] .

2- Variety: Generatingdata froma varietyofsources results ina rangeofdatatypes.Thesedatatypescanbestructured(e.g. e-mails),semi-structured(e.g.logﬁlesdatafromawebpage);and unstructured(e.g.customerfeedback),andhybriddata [18] . 3- Velocity: Isan indicationofthe speedof thedatawhen being

generated, streamed, and aggregated [19] . It can also refer to the speed at which the data has to be analyzed to maintain relevance [17] .

Depending onthe researcharea andthe problemspace, other termsorVscanbeadded.Forexample,isthisdataofanyvalue? Howlong canwe considerthisanaccurate andvaliddata?Since we are conducting a survey, we ﬁnd it compelling to brieﬂy introduce other Vs aswell. Typically, thenumber ofanalyzed Vs is 3 to 7 in a single paper (e.g.6V+C [20] ), whereC represents Complexity, however,different papersanalyze differentsets of Vs and the union (sum) of all the analyzed Vs among all surveyed papersis8VandaC,asshownin Table 1 .

4- Value: Is a measure of data usefulnesswhen it comes to de-cision making [19] , or how much added-value is brought by thecollected datato theintended process, activity,or predic-tiveanalysis/hypothesis [21] .

5- Veracity: Refers tothe authenticity andtrustworthiness ofthe collected data against unauthorized access and manipulation [21,22] .

6- Volatility:Anindicationoftheperiodinwhichthedatacanstill beregardedasvalidandforhowlongthatdatashouldbekept andstored [23] .

7- Validity:Thismightappearsimilartoveracity;however,the dif-ferenceisthatvaliditydealswithdataaccuracyandcorrectness regardingtheintendedusage.Thus,certaindatamightbevalid foranapplicationbutinvalidforanother.

8- Variability:Thisrefers to theinconsistencyof thedata.Thisis duetothehighnumberofdistributedautonomousdatasources [24] . Other researchers refer to the variability as the consis-tencyofthedataovertime [22] .

9- Complexity: A measure of the degree of interdependence and inter-connectednessinbig data [20] .Such that, asystem may witnessa (substantial,low, or no) effect due to a very small change(s)that ripplesacrossthesystem [19] .Also, complexity canbeconsideredintermsofrelationship,correlationand con-nectivity ofdata. It can further manifestin terms ofmultiple data linkages, and hierarchies. Complexity and its mentioned attributescan howeverhelpbetterorganizebigdata.Itshould benotedthat complexitywasincludedamongthebigdata at-tributes(Vs)in [20] wherebigdatawascharacterizedashaving 6V+complexity.Thisishowwewillarrangeitin Table 1 .

Theprocess ofextractinghidden, valuablepatterns,anduseful information from big data is called big data analytics [44] . This is done through applying advanced analyticstechniques on large data sets [28] . Before commencing the analytics process, data sets may comprisecertain consistencyand redundancyproblems

Fig.1. HadoopV1.xarchitecture.

affecting their quality. These problems arise due to the diverse sources fromwhich thedata originated.Data pre-processing tech-niquesareusedtoaddresstheseproblems.Thetechniquesinclude integration, cleansing (or cleaning), and redundancy elimination, andtheywerediscussedbytheauthorsin [39] .

Bigdataanalyticscanbe carriedoutusinganumberof frame-works(shownbelow) that usually requirean upgradeablecluster dedicatedsolely forthat purpose [17] . Evenifthe cluster canbe formedusinga numberofcommodity servers [45] ,however,this still forms an impediment forlimited-budget users who want to analyzetheirdata.Thesolutionispresentedthroughthe democra-tizationofcomputing.Thismadeitpossibleforany-sizedcompany andbusiness ownerstoanalyzetheir datausingcloudcomputing platformsforbigdataanalytics.Consequently,theuseofbigdata analyticsisnotlimitedtoenterprise-levelcompanies.Furthermore, business owners do not have to heavily invest in an expensive hardwarededicated toanalyzingtheir data [1] .Amazonisone of thecompaniesthatprovide‘cloud-computed’bigdataanalyticsfor its customers.The serviceiscalled AmazonEMR (Elastic MapRe-duce),anditenablesuserstoprocesstheirdatainthecloudwith a considerably lower cost in a pay-as-you-use fashion. The user isable toshrink orexpand thesize ofthe computingclustersto controlthedatavolumehandledandresponsetime [1,46]

(3)

Fig.2. HadoopV2.xarchitecture.

HadoopV1.x(shownin Fig. 1 )consistsoftwoparts:theHadoop DistributedFileSystem(HDFS)thatconsistsofastoragepart,and adataprocessingandmanagement(MapReduce)part.Themaster nodehastwoprocesses,aJobTrackerthatmanagestheprocessing tasksandaNameNodethatmanagesthestoragetasks [50] .

When a Job Tracker takes job requests, it splits the accepted job into tasks and pushes them to the Task Trackers located in theslave nodes [51] . The Name Noderesembles the master part, whiletheDataNodesrepresenttheslavepart [12] .There ismore explanationintheHDFSpartbelow.

Manyprojectsweredevelopedinaquesttoeithercomplement or replace the above parts, and not all projects are hosted by the Apache Software Foundation, which is the reason for the emergenceofthetermHadoopecosystem [47] .

Hadoop V2.x isviewedasa three-layered model.Theselayers areclassiﬁedasstorage,processing,andmanagement,asshownin Fig. 2 .ThecurrentHadoopprojecthasfourcomponents(modules), whichareMapReduce,theHDFS,YetAnotherResourceNegotiator (YARN),andCommonutilities [17] .

1- MapReduce:Asaprogrammingmodel,MapReduceisusedasa data processing engine and for cluster resource management. WiththeemergenceofHadoopv2.0,theresourcemanagement taskbecameYARN’sresponsibility [17] .WordCountisan exam-ple illustrating how MapReduce works. As the name implies, it calculates the number oftimes a speciﬁc word is repeated within a document. Tuples

w, 1

are produced by the map function, where w and 1 represents the word and the times it appearedinthedocument respectively.Thereduce function groupsthetuplesthatsharethesamewordandsumstheir oc-currencestoreachtheconcludingresult [61] .

2- HDFS: HDFS represents the storage ﬁle-system component in the Hadoop ecosystem. Its main feature is to store huge amounts of data over multiple nodes and stream those data setstouserapplicationsathighbandwidth.Largeﬁlesaresplit into smaller 128MB blocks, with three copies of each block of data to achieve fault tolerance in the case of disk failure [17,52,53] .

3- YARN:YARNwasintroducedinHadoopversion2.0,andit sim-ply took over the tasks of cluster resource management from MapReduce and separated it from the programming model, thus making a more generalized Hadoop capable ofselecting programming models, like Spark [54] , Storm [55] , and Dryad [56,57] .

4- Commonutilities:TooperateHadoop’ssub-projectsormodules, asetofcommonutilitiesorcomponentsareneeded.Shared li-brariessupportoperationslikeerrordetection,Java implemen-tationforcompressioncodes,andI/Outilities [17,58] .

Over the last few years, researchers in telecommunication networks started to consider big data analytics in their design toolbox. Characterized by hundreds of tunable parameters, wire-less network design informed by big data analytics received mostofthe attention, however,other typesof networksreceived increasingattentionaswell.

The vast amount of data that can be collected from the networks, along with the distributed modern high-performance computingplatforms, canlead tonewcost-effective designspace (e.g. reducing total cost of ownership by employing dynamic VirtualNetworkTopologyadaptation)whencomparedto classical approaches(i.e. staticVirtualNetworkTopologies) [59] .Thisnew paradigm is promising to convert networks from being sight-less tubes for data into insightful context-aware networks. Our contributionsinthispaperareasfollows:

1- Weshow inthis papertherole bigdata analyticscanplay in wirelessandwirednetworkdesign.

2- Theabove roleiscorroborated throughthe illustrationofcase studiesin Section 2 .

3- The signiﬁcance of this paper lies in helping academic re-searcherssave mucheffort byunderstanding the state-of-the-artandidentifyingtheopportunities,aswellasthechallenges facingtheuseofbigdataanalyticsinnetworkdesign.

4- In addition to academic approaches, we surveyed network equipmentmanufacturing companies highlightingnetwork so-lutionsbasedonbigdataanalytics.Wealsoidentiﬁedthe com-monareasofinterestamongthesesolutions,andthusthis sur-veycanbeneﬁtbothacademicandindustrial-orientedreaders. 5- Thispaperprovidedinsightsonpotentialresearchdirectionsas

illustratedin Section 8 .

This paper is organized as follows: Section 2 presents sev-eral case studies uses big data analytics in wireless and wired networks. Sections 3 –6 illustrate the research conducted in the directionof employingbigdata analyticsinthe ﬁeldsof cellular, SDN & intra-data center, optical networks, and network secu-rity, respectively. Section 7 summarizes some of the main big data-based network solutions offered by industry. Section 8 dis-cusses the network design cyclebased onbig data analyticsand highlights the challenges encountered in big data-powered net-work design. In Section 9 we propose open directions forfuture research.Finally,thepaperendswithconclusionsin Section 10 .

2. Casestudiesoftheuseofbigdataanalyticsforwirelessand wirednetworks

2.1. Detectionofsleepingcellsin5GSON

(4)

(reference scenario) to −50 dBi (SC scenario). Measurements re-portedfromUEsarethencollectedfromeachscenarioandstored in a database. The reference scenario provided measurements used by an anomaly detectionmodel that is basedon k-nearest-neighbor algorithm to provide a network model with normal behavior. Multidimensional Scaling (MDS) is used to produce a minimalisticKeyPerformanceIndex(KPI)representation.Thusthe interrelationship between Performance Indexes (PIs) is reﬂected and an embedded space is constructed. Consequently, similar measurements (i.e. normal network behavior) lie within close distances whiledissimilar measurements (i.e.anomalous network behavior)arefar-scatteredandhenceeasily identiﬁed.Themodel attained94percentdetectionaccuracywith7minstrainingtime.

2.2. AproposedarchitectureforfullyautomatedMNOreporting system

MobileNetworkOperators(MNOs)collectvastamountsofdata fromanumberofsourcesasitcanofferactionableplansinterms of service optimization. Visibility and availability of information is vital for MNOs due to its role in decision making. Employing a reporting system is pivotal in the cycle of transforming data to information, knowledge, and lastly to actionable plans. The authors in [61] presented a case study aimed at illustrating the potential roleofbigdataanalyticsinthedevelopmentafully au-tomatedreportingsystem.AMoroccanMNOistobeneﬁtfromthe alternativearchitecture. Theauthorshighlightedtheshortcomings of the existing automatic reporting system that uses traditional technologies.Moreover,they inferredthatusingbigdataanalytics canprovidetheopportunitytoovercomethoseshortcomings.

The authors chose the Apache Flink [61] in their proposed architecturetoserveastheirbigdataanalyticsframework.Several reasons contributed towards this choice, including the Apache Flink’s ability to process data in both stream and batch modes, ease of deployment, andfast executionwhen compared to other frameworks such asSpark. Furthermore,the apacheFlink can be integratedwithotherprojectslikeHDFSfordatastoragepurposes. Moreover, Apache Flink is scalable which makes it an optimal choiceforthissystem.

2.3. NetworkanomalydetectionusingNetFlowdata

Big data analytics can support the efforts in the subject of network anomaly andintrusion detection.Towards that end, the authorsin [62] proposedanunsupervisednetworkanomaly detec-tionmethodpoweredbyApacheSparkclusterinAzureHDInsight. The proposed solution uses a network protocol called NetFlow that collects traffic information that can be utilized for the de-tection of network anomalies. The procedure starts by dividing the NetFlows dataembedded inthe raw datastream into 1min intervals. NetFlows are then aggregated according to the source IP, anddatastandardizationis carriedout.Afterwards, ak-means algorithmisemployedtocluster(accordingtonormalorabnormal traffic behavior)theaggregatedNetFlows.Thefollowingstepisto calculate the Euclidean distance between the cluster center and its elements. The procedure concludes by evaluating the success criteria. The authors considered a dataset containing 4.75 h of recordscapturedfromCTUUniversitytoanalyzebotnettraffic.The proposed approach attained 96% accuracy and the results were visualized in 3D after employing Principal Component Analysis (PCA)toattaindimensionreduction.

3. Roleofbigdataanalyticsincellularnetworkdesign

In this section, we review the research done on the use of big data analytics for the design of cellular networks. Compared

to other network design topics, we observed that the wireless field has received the most attention, as measured by its share of research papers. These papers can be classified according to theapplicationorareaunderinvestigation.Consequently,wehave classifiedthosepapersintothefollowing:

1-Counter-failure-related:Thisincludesfaulttolerance(i.e. detec-tionandcorrection),prediction,andpreventiontechniquesthat usebigdataanalyticsincellularnetworks.

2-Networkmonitoring:Thisillustrateshowbigdataanalyticscan be beneﬁcialasalarge-scaletool fordatatraﬃcmonitoringin cellularnetworks.

3-Cache-related: Investigateshowbigdataanalyticscanbe used for content delivery, cache node placement and distribution, location-speciﬁccontentcaching,andproactivecaching. 4-Network optimization: Big data analytics can be involved in

severaltopicsincludingpredictivewirelessresource allocation, interferenceavoidance,optimizingthenetworkinlightof Qual-ityofExperience(QoE),andﬂexiblenetworkplanninginlight ofconsumptionprediction.

Itshould be notedthat Table 2 provides furtherdetailed clas-siﬁcation,withthechancetocomparetherole playedbybigdata analyticsacrossdifferentnetworktypesandapplications.

3.1. Failureprediction,detection,recovery,andprevention

3.1.1. Inter-technologyfailedhandoveranalysisusingbigdata Oneofthemostfrustratingencountershappenswhenamobile subscribergetssurprisedbyasuddencalldrop.Manyofthese in-cidentsoccurwhentheuserisattheedgeofacoverageareaand moving towards another,technologically-different area, e.g., mov-ingfroma3GBaseStation(BS)toa2GBS.Thecommonsolutions toaddresssuchshortcomingsareby eitherconductingdrivetests orperformingnetworksimulation.However,anothersolutionthat leverages the powerof big data wasproposed by the authors in [63] .The proposed solutionusesbig dataanalytics(Hadoop plat-form)toanalyzetheBaseStationSystemApplication Part(BSSAP) messages exchanged between the Base Station Subsystem (BSS) andMobileSwitchingCenter(MSC) nodes.Locationupdates(only those involved in the inter-technology handover) are identiﬁed andthegeographiclocationswherethe3G-servicedisconnections occurareidentiﬁedbyrelyingontheprovidedtargetCellID.

The results of the above method were then compared with a drive test (which is an expensive and time-consuming ap-proach) results, where coherence between the two results was demonstrated. Another comparison was conducted with the Key Performance Index(KPI)-based approach and the resultswere in favoroftheproposedapproach.

3.1.2. Signalingdata-basedintelligentLTEnetworkoptimization Byutilizing the combination of all around signalingand user and wireless environment data, combined with Self-Organized Network technologies (SON), full-scale automatic network opti-mizationcouldberealized.

The authors of [27] developed an intelligent cellular net-work optimizationplatform based onsignaling data.This system involvesthreemainstages:

(5)

Table2

Researchsummary.

Networktype Researchcategory Reference Proposedordeployedtechnique Wireless Failureprediction,Detection,

Recovery,andprevention

[63] Analyzedinter-technology(2G-3G)failedhandovers.

[27] UsedXDRdatatodiscovernetworkfailuresandpresentasolutionadvice.

[64] DevelopedCADMwhichusesCDRstoidentifyanomaloussites.

[65] Presentedthreecasestudiesofself-healingusingbigdataanalytics.

[68] Suggestedtheanalysisofthebandwidthtrendstopredictequipmentfailure. Networkmonitoring [69] DevelopedaHadoop-basedsystemtomonitorandanalyzenetworktraﬃc.

[70] Developedasolutionpoweredbybigdataplatformswithdistributedstorage anddistributeddatabasetosolvetheissuesofdataanalysisandacquisition. Cacheandcontentdelivery [26] Utilizedbigdatatoformaclustermadeupofnearbyusersthatsharethe

basestation’swirelesschannel.

[72] Analyzedthedatathatresideswithinthecachenodestoenhancethe determination,allocation,anddistributionofcachenodes.

[68] Suggestedmonitoringandanalyzingsocialmediaandpopularsites,topredict andcachecertaincontents,accordingtoagecategory,atthepredicted locationswherethesecontentsarehighlydemanded.

[34] Proposedtheuseofbigdataanalyticsandmachinelearningtechniquesto proactivelycachepopularcontentin5Gnetworks.

Networkoptimization [73] Presentedthreecasestudiesinwhichaproposednetworkoptimization frameworkiseﬃcientlyutilized.Inparticular,theworksuggested: (1)TheuseofbigdataanalyticstomanageresourcesinHetNets.Thisisdone

inthreestages(networkplanning,resourceallocation,andinterference coordination).

(2)ThedeploymentofcacheserversinmobileCDN. (3)TheoptimizationofnetworkswithQoEinmind.

[76] ProposedNCLself-conﬁguration/optimizationalgorithmstoachievean automatic,self-optimizedhandover.TheworkreliedontheprocessingofCM andPMKPIsusingbigdataanalyticsplatform.

[77] Developedathree-stageframeworkthatutilizesthenetworkanduserKPIsto reachanoptimalallocationofradioresources(PRBs).

[60] Presentedaframeworkthatusesbigdatacollectedfromthecellularnetwork toempowerSON.Theyalsopresentedacasestudyonhowtodetect sleepingcellsusingthisframework.

[27] Investigatedtheimpactofbigdataon5Gnetworksintermsof: (1)Eﬃcientcontentprovisioning.

(2)Flexibilityinfunctionalityandnetworkdeployment. (3)Utilizinguserbehaviorinwirelessresourceoptimization. (4)Achievinghighlyeﬃcientnetworkoperation.

(5)SavingenergyinHetNetorMulti-RATnetworks.

[68] Correlatedlocationdata,serviceusage,andothercontextualdatatopredict theconsumptiontrendsandselecttheoptimalnodelocation.

SDNandInter-DataCenter Traﬃcprediction [85] Dynamicallocationofnetworkresourcesbyrelyingontraﬃcpredictedby employingHadoopplatform.

[86] DevelopedPythia,asystemthatusesHadoop’spropertiestopredictthe volumeofdatacommunicationatruntimeindatacenternetworks. Traﬃcreduction [87] ProposedCamdoopandrunitoverCamCube,theperformancesurpassedthat

ofCamdooprunningonaconventionalswitch. Optical Networkoptimization [91] UsedHadooptoﬁndasolutionfortheRWAproblem.

[59] PredictedfuturetraﬃcbyusingBigDataAnalyticstoreconﬁgureVNT regularly.

Traﬃcprediction [93] Employed Hadoop MapReduce to solve multiple optimization problems of bin-packingnature.

Security [100] Proposed a threat detection framework to detect peer-to-peer Botnet attacks.

[104] Developed B-dIDS, a system that scans IDS log ﬁles to detect multi-pronged attacksindistributednetworks.

[106] Surveyedthetopicofwirelessdeviceﬁngerprintingmethodsandhowmachine learningalgorithmscanbeusedfordeviceidentiﬁcation.

[107] Providedasurveydiscussingtheroleofdataminingandmachinelearning algorithmsinintrusiondetectionmethods.

2- Problemdiscovery:Serviceestablishmentrate,thehandover suc-cessrate,anddroprateareamongthenetworksignaling-plane statusesthatcanbereﬂectedbytheXDR-basednetwork perfor-manceindicators.Networkequipmentwithunsatisfactory per-formance indicators can be further analyzed, andthis can be done by conductinga furtherexcavation ofthe corresponding indicators’originalsignaling.

3- Providingbestpracticesolutions:Identifiedandsolved problems can provide an optimization experience. As a consequence, a varietyofnetworkproblemscanbeverified.Forexample,when acellhasalowhandoversuccessrate,accordingtothe defini-tionoftheassociatedindicators,thereasonissuggestedtobe

thelowsuccessrateofthehandoverpreparation.Thesolution wouldbetoadjusttheoverlapping coverageareasformed be-tweenthesourceandthetargetcellsandtheparameters(e.g., thedecisionthresholdoffsetandthehandoverinitiation).

Arecommendedsolutioncanbeprovidedwhenadeteriorating indicator surfaces,and this is simply done by clicking the index querythatcausedthedeterioration.

3.1.3. Anomalydetectionincellularnetworks

(6)

identified by examiningtheCallDetailRecord(CDR) oftheusers inaspecificarea.CDRfilesaregenerateduponmakingacall,and include,among other information,thecaller andcallednumbers, thecallduration,thecallerlocation,andthecellIDwherethecall wasinitiatedorreceived.

ACDRbasedAnomalyDetectionMethod(CADM)wasproposed by the authorsin [64] . CADMwasused to detectthe anomalous behaviorofusermovementsinacellularnetwork.Thiswasdone, ﬁrst, withthe CDR data beingcollectedfrom the networknodes and stored in a mediation department. Then, the second phase starts by distributing the collected CDRs to the relevant depart-ments (e.g., data warehouse, billing, and charging departments). After that, theHadoop platform is used todetect the anomalies. The discovered anomalies are then fed-back to the mediation departmentforadequateactions.

The use ofbig dataanalytics wasessential inthis case. Large datasets thatrequiredistributedprocessingacrosscomputer clus-ters were processed by the Hadoop Platform. The result was an improved system that is able to detect location based anomalies andimprovethecellularsystem’sperformance.

3.1.4. Self-healingincellularnetworks

The idea to develop a system that is capable of monitoring itself, detecting the faults, performing diagnoses, issuing a com-pensationprocedure,andconductingarecoveryisveryappealing. However, the self-healing process has another factor to keep in mind, whichis time. The process shouldbe carried out within a reasonableamountoftime soitwouldnotdegradethequalityof thedeliveredservices.

Three use cases were presented by the authors in [65] for a self-healingprocessincellularnetworks:

1- Data reduction: The Operation and Maintenance (O&M) database can be used for troubleshooting purposes. How-ever,thedatabasesizeisrelativelylargeasitcontainsthedata related to both normal and degraded intervals, which makes itdiﬃcultto process.Separatingtheintervalstojustkeepthe degradedintervalswillhelp inreducingthat size.Theauthors proposedparallelizing thisprocess independentlybyanalyzing eachBSseparately.

Theychosethedegradedintervaldetectionalgorithmof [66] (a degraded intervalisthetime wheretheBSbehaviorisdegraded), andtheseintervalsweredetectedbycomparingtheBS’sKPIstoa certainthreshold.Thisalgorithmwasparallelizedbyimplementing it asamap function, aﬁeld isadded toidentify eachBS, andall theﬁeldsareaddedbyareducefunction.

2- Detectingsleeping cells: Celloutage orsleepingcellsis a com-monprobleminmobilenetworks.Usersaredirected to neigh-boringcellsinsteadof thenearest andoptimalcell.According to the algorithm described in [67] , sleeping cells can be de-tectedthroughtheutilizationofneighboringBSmeasurements hence calculating the impact of the sleeping cell outage. The detectionprocess relies on theResource Output Period (ROP), whereeachBSproducesConﬁgurationManagement(CM),Fault Management (FM), and Performance Management (PM) data every15min.ForeachBS,incominghandoversfrom neighbor-ingBSsareaggregatedforthecurrentandpreviousROP.Ifthe numberofhandoverssuddenlydroppedtozero,anda malfunc-tionisindicated bythe cell’sPerformanceIndicators (PIs),the cellisregardedasasleepingcell.

The authorsin [65] proposed the useofthe above-mentioned algorithm under the big data principle. They proposed to divide theterrainintopartitionsthatarethemaximumdistancebetween neighbors, whereeach BSwithin the partitioned area is

sequen-tiallytestedby an instance ofthealgorithm, andthisisdone by examiningthedataofitsneighbors.

This approach was compared to other methods (e.g., lack of KPIsandavailability of KPIs),andmost ofthe simulatedoutages weredetected (5.9%falsenegativesand0% falsepositives).While a lack of KPIs and availability of KPIs methodologies showed a highpercentageoffalsenegatives.

3-KPI Correlation-based diagnosis: The authors in [65] used a methodthatutilizesmostcorrelatedKPIstoidentifythe prob-lemcause.Tosimplifytheanalysistask,thealgorithmconsiders thePIsofboththeaffectedBSandtheneighboringsectors.

MapReduce wasusedto implement thisalgorithm ina paral-lelizedmanner, the correlation process andthe creation of a PIs listarrangedbycorrelationwereimplementedasmapandreduce functions,respectively.

3.1.5. Cellsiteequipmentfailureprediction

Asuddenoutage ofservicesmighthaveserious consequences, andthisiswhykeepingcommunicationequipment,likecellsites, inagoodworkingstateisofhighimportance.Thechallenge iden-tiﬁedbytheauthorsin [68] istoanalyzetheuser’sbandwidthon thecelllevel.Equipment(s)failureandinfrastructurefaults canbe predictedbyanalyzingthebandwidthtrendsinaparticularcell.

Duetothesizeanddiversityofthecollecteddata,itis essen-tial to use big data analytics to process it. Thus, the customers’ receivedbandwidthcanbe acquiredovera particulartime period (i.e.,monthoryear,etc.).Nextthedatafromdiversedatasources areintegratedandthenanalyzedtoknowthebandwidthtrends.

3.2.Networkmonitoring

3.2.1. Large-scalecellularnetworktraﬃcmonitoringandanalysis Large cellular networks have relatively high data rate links and high requirements to meet. Usually these networks use a high-performance and large capacity server to perform traﬃc monitoringandanalysis.

However, with the continuous expansion in data rates, data volumes,andtherequirementsfordetailedanalysis,thisapproach seemstohavealimitedscalability.Hence,theauthorsof [69] pro-posed a system to undertake that task, utilizing the Hadoop MapReduce, HDFS, and HBase (a distributed storage system that manages the storage of structured data and stores them in a key/value pair) as an advanced distributed computing platform. They exploited its capability of dealingwith large data volumes while operating on commodity hardware. The proposed system wasdeployedin the coreside ofa commercial cellular network, and it was capable of handling 4.2TB of data per day supplied through123Gbpslinkswithlowcostandhighperformance.

3.2.2. Mobileinternetbigdataoperator

China Unicom, China’s Largest WCDMA 3G mobile operator with 250 million subscribers in 2012, introduced an industry ecosystem. The researchers in [70] highlighted this asa telecom operator-centricecosystemthatisbasedonabigdataplatform.

(7)

ComparedwiththeOracledatabase,itisnotedthatthesystem achieved a four times lower insertion rate. The query rate was compared to an Oracle database as well, and the HBase showed a better performance when takinginto consideration the impact imposedbytherecords’size.

3.3.Cacheandcontentdelivery

3.3.1. Optimizedbandwidthallocationforcontentdelivery

Mobile networks, usually, have a large number of users, and with the increase in Internet-based applications, it has become essential to allocatethe required bandwidth that meets the user expectations,as well as to ensure a competitive level of service quality.CellularnetworkscanprovideInternetconnectivitytotheir users at any time; however, video (especially high quality) con-tentsarestillslowandrelativelyexpensive.Fromthebasestation’s pointofview,theimpactofforwardingthesamevideocontentto severalusersonthesamebasestationismassive.TheLTEsystem addressedthisthroughmulticasttechniques.However,multicastis stillregardedasabigchallengeincellularnetworks.Toovercome theabove problem, the authors of [26] proposed a solution that candynamicallyallocatebandwidth.Theideaisbasedonsharing thebase station’s wireless channel by a usercluster that wishes todownloadthecontents.Thus,savingthebasestationresources, as well as providing a better data rate for the clustered users, andprovidingan opportunity forthe userswho did not join the clustertobeneﬁtfromthesavedresources(bandwidth).Itshould be noted that the clusteredusers can receive the contents from thecluster head by usingshort rangecommunication techniques likeWi-FiDirect [71] andDevicetoDevice(D2D).

Two conditions have to be satisﬁed before forming a user cluster. First, the users who request the same content are the oneswhoformthecluster.Second,theusersshouldbeorwillbe within a short range of each other. Forthat reason, the authors suggestedusing bigdataanalyticstoidentifytheusers’ closeness and to group the users into cluster(s). A cluster head is then selected among the nearby users, and the process is repeated among the base station users until there is either a cluster of usersora free (un-clustered)user(s).The simulation wascarried for a single base station network and the results showed faster contentdeliveryandimprovedthroughputattheuserlevel.

3.3.2. Improvecachenodedetermination,allocation,anddistribution accuracyincognitiveradionetworks

In cognitive radio networks, Secondary Users (SU) have to leave the licensed spectrum when their activity starts to affect theQoSlevel ofthelicensed users.This movewouldrequire the existenceofacache nodeto compensatefortheinterrupted data transactionsduringtheSUswitchtotheunlicensedspectrum.

The author of [72] proposed the use of big data analytics to process the data accumulated over time within the nodes. The goalwastoutilizethisdatatoreachadecisiononthecachenode distributioninaclusternetwork.

The author selected two out of three categories (open and selectively open systems) of cognitive radio networks. Due to the nature of the open systems, every SU willingly shares its informationto be processed, which resultsin a large amount of data,sothepredictionaccuracyishigh.

For the selectivelyopen systems,the SU selectivelyshares its informationwitheithersome cachenodes,withthe cluster head for a particular time interval, or with speciﬁc SUs in a cluster. Thisresultsinavariableamountofshareddata,thusresultingin variableaccuracy.

3.3.3. Trackingandcachingpopulardata

The number of social network (i.e., Facebook and Twitter) users ismassive. The multimedia contents ofthesenetworks are normally shared between commoninterest groups. However, big andimportant events attract a lotof attentionand consequently a lotofcontent is sharedacross thesenetworks. When a certain video oreventgoesviral, thissharing willeventually burdenthe networkasthe requested contentwouldhaveto travelalong the networkonitswaytotheservers.Thesolutiontosuchaproblem wassuggestedby the authorsof [68] , they suggestedmonitoring popularandsocialmedia websites,analyzingthedata,identifying if there is a growing interest in certain content, by which age category, andcachingthe populardataforaspeciﬁcbasestation. Big data analytics can be of major use in this situation by em-ployingittodotherequiredanalysis.The resultwouldbecached content available to theusers faster (reducedprovisioning delay) andwithoutburdeningthenetwork.

3.3.4. Proactivecachingin5Gnetworks

Cache-enabledbasestationscanservecellularsubscribers,this isdonebypredictingthemoststrategiccontentsandstoringthem intheircache.Thus,minimizing boththeamountoftimeandthe consumed network bandwidth, which can payoff in other ways (i.e.,lesscongestionandlessresourceutilization).

An approach, proposed by the authors in [34] , used big data analytics and machine learning to develop a proactive caching mechanism by predicting the popularity distribution of the con-tentin5Gcellularnetworks.Theydemonstratedthatthisapproach can achieve eﬃcient utilization of network resources (backhaul oﬄoading)andanenhanceduserexperience.

After collecting therawdata,i.e., theusertraffic,the bigdata platform(Hadoop)hasthetaskofpredictingtheuserdemandsby extracting the useful information,like Location Area Code (LAC), Hyper Text Transfer Protocol (HTTP) request-Uniform Resource Identifier(URI), TunnelEndpoint Identifier(TEID)-DATA,andTEID forcontrolanddataplanes.Thenusingthisinformationto evalu-atethecontentpopularityfromthepreviouslycollectedrawdata. Experimentallytestingthisworkon16basestations,aspartofan operationalcellularnetwork, resultedin100%requestsatisfaction and98%backhauloffloading.

3.4. Networkoptimization

3.4.1. Bigdata-drivenmobilenetworkoptimizationframework Whenthinkingaboutoptimizingacellularnetwork,itis impor-tanttocollectasmuchinformationaspossible.Largenetworks,as wellastheirusers,generateaplethoraofdata,forwhichtheuse ofbigdataanalyticsisvitaltoanalyzethecolossalamount.

The authors in [73] proposed a mobile network optimization frameworkthatisBigDataDriven(BDD).Thisframeworkincludes severalstages, startingfromthe collection ofbig data,managing storage,performingdataanalytics,andthelaststageoftheprocess isthenetworkoptimization.

Three case studies were used to show that the proposed frameworkcouldbeusedformobilenetworkoptimization.

1- ManagingresourcesinHetNets

The Mobile Network Operators (MNOs) may use big data to provide real time and history analysis across users, mobile networks, and service providers. MNOs can beneﬁt from BDD approachesintheoperationanddeploymentoftheirnetwork,and thiscanbedoneinseveralstages:

(8)

amount of information (user and network)is provided for analyses.BigdataanalyticscanhelpMNOsreachbetter de-cisions concerning the deployment of eNB in the mobile network. Theauthors in [73] suggestedthe useofthe net-workandanonymoususers’data(e.g.,dynamicposition in-formation and other service features). Providing a relation between the data and their events can offer a better un-derstanding of the traﬃc trends. Big data sets provide ac-tionable knowledge to reach an optimal decision concern-inghowandwheretodeployeNBsinthenetwork.Another importantfeatureistheabilitytoprepareforfuture invest-mentsdependingonthepredictedtraﬃctrends.

(B) Predictive resourceallocation: Resource requirementschange dependingonthedensityandusagepatternsofmobile net-work subscribers.Predicting whereandwhenmobile users areusingthenetworkcanhelpinpreparingforsudden sig-nificanttrafficfluctuations.Theauthorsin [73] suggestedthe use of big dataanalytics to examine behavioral and senti-mentdatafromsocialnetworksandothersources.Theyalso showedaninterestinutilizingcurrentandhistoricaldatato predictthetrafficinhighlypopulatedareaswithinthe net-work.

UsingthecloudRANarchitecture [74] ,therightplaceatthe righttimecanbeservedthroughthepredictiveresource al-location,thusminimalservicedisruptioncanbeachieved.

(C) Interference coordination: HetNets with small cells can be usedtoconductinterferencecoordinationamongmacroand small cells. This coordination has to be carried out in the timedomaininsteadofthefrequencydomain.Schemeslike the enhanced Inter-Cell Interference Coordination (eICIC) in LTE-Advanced [75] efficiently enable resource allocation among interferingcells, as well asimproving theinter-cell load balancing in the HetNets. eICIC allows Macro cells evolvedNodeB(MeNB)anditsneighboringSmallcelleNBs (SeNBs) to have data transmitted in isolated subframes, thus interference from MeNB to SeNB can be avoided. To implement eICIC, a special type of subframe named an Almost Blank Subframe (ABS) that carries minimum (and mostessential)controlinformation,wasdefined.Itisworth notingthattheABSsubframesaretransmittedwithreduced power [75] , andthat thenetwork operator cancontrol the configurationofthatsubframe.

Many factors contribute to the determination of the ABS ratioofthe macrocellto thesmallcell, such asthetraﬃc load in a speciﬁc area, the service type, and so on. The optimalABSratiovariesdynamically, andthisisduetothe fact that inter-cell interference changes with time for the factorsmentionedabove.

In a BDD system, optimizing the radio resource allocation can beaccomplished throughthe useofnetwork analytics. ThedeploymentofBDDoptimizationfunctionsattheMeNB would enable them to collect and analyze eNB-originated rawbigdata(e.g.,servicecharacteristicsandtraﬃcfeatures) inreal-time,thusenablingaquickresponse.Asaresult,the performance optimizationofeachcellandtheuserscanbe fulﬁlled.

OptimizingICICparameters(e.g.,ABSratio)canbeachieved by processing raw data in a periodic manner to acquire statisticsandtodetecttraﬃcvariationsautomatically.

2- DeploymentofcacheserverinmobileCDN

Popular content (e.g., movies) can be delivered through a ContentDelivery Network(CDN), whichis a methodthat is con-sideredeﬃcientbymanyMNOs.Distributedcacheservers should be locatedneartheuserstoachieve afastresponseaswell asto

reduce the delivery cost. In hierarchical CDN, it is vital to place cache servers in an optimallocation. Due to the unique features thatRANhas,itwastheprimaryinterestoftheauthorsin [73] .

It is expected that there will be an enhanced backhaul ca-pability in 5G networks, and this would result in minimizing the concerns related to the latency and traﬃc load of backhaul transmissions.Therefore,notallMeNBswouldrequireadedicated distributedcacheserver.Inaddition,aSeNBcanhaveadistributed cacheserver.

Optimal cache server placement depends on several factors, suchasthefeaturesandloadoftraﬃcinagivenarea,aswellas the cost ofstorage andstreaming equipment.To help the MNOs decidewheretodeploytheircacheservers,dataanalyticsmethods canberegardedasafeasiblesolution.However,thiswouldrequire the collection of all the above-mentioned factors over a long periodintherelatedcoveragearea.

3-QoEmodellingforthesupportofnetworkoptimization

Theauthors of [73] believedthat themanagement of services and applications needed more than just relying on the QoS pa-rameters.Instead,they suggestedtakingthe quality(i.e.,QoE), as perceived by the end users, to be regarded as the optimization objective. Accurate and automatic real time QoE estimation is important to realize the optimization objective. In addition to the technical factors, non-technical factors (e.g., user emotions, habit,andexpectations,etc.)canaffecttheQoE.Aproﬁleforeach particularuser comprisingthe above non-technicalfactors would helpintheQoEevaluation.

Since answering the questions that would lead to a clear profileis not a taskthat would be fanciedby a typical user, the authors suggested installing a profile collection engine on the users’ mobile devices. User activities are compared and tracked torecognizedifferencesandsimilarities, andthenthey arestored inadatabaseforadditionalprocessing.After profiling,the follow-ing step constitutes the use of machine learning to identify the relationshipbetweenQoEandtheinfluencingfactors.

Data analyticscan be used to discover what impacts the QoE in users’devices, as well as the services andnetwork resources. The next step is for network optimization functions to react to determining what caused the problem and select the optimal actionaccordingly.

3.4.2. ImproveQoSincellularnetworksthroughself-conﬁguredcells andself-optimizedhandover

Cellularnetworkshaveacrucialelementonwhichtheconcept of mobility depends. This element is the handover success rate, whichensurescallcontinuitywhiletheusermovesfromone cell to another. Failing in that particular element would impact the qualityoftheservice,thusputtingtheoperatorintoaquestionable situation.

Operatorstrytomakesurethateachcellhasalistofmanually conﬁguredandoptimizedneighborcells.Hence, itisvitaltonote the highprobability of thesecells failing to adapt when a rapid responseisrequiredduetoasuddennetworkchange.

The authors in [76] presented two methods that used big data analytics to introduce a self-configured and self-optimized handoverprocess,the firstwasassociated withnewly introduced cells, while the latter was concerned with the already existing cells. The analysis started by collecting and archiving predefined handover KPIs. A dispatcher process is run after the collection period,anditsaimistocheckthefilestoseeiftheyweremarked asnewcells(where Self-ConfigurationAnalytics isstarted)ornot (whereSelf-OptimizationAnalyticsisstarted):

(9)

Newly installed basestations require Neighbor CellList (NCL) tobeconfiguredonthenewcells.Theselectionprocesstakesinto considerationtheantennatype,theazimuthangle(fordirectional cells), the geographic location of the candidate cells, and the processconcludesbyselectingcellswithaminimumdistanceand maximum traffic load to be the top candidate cells. The NCL is configuredviaNetworkManagement System(NMS)Configuration Management(CM)tools.

2- NCLself-optimizationforexistingcells

The processstartsby collectingKPI measurementstatistics for thefailedandsuccessful handovers,andthistaskis done by the PerformanceManagement(PM)ortheNMS.

Cellswithahandoverfailureratebelowapredeﬁnedthreshold areexcluded fromtheNCL,while unlistedneighboringcellswith a successfulrate above a predeﬁned threshold are considered as newneighboringcells.

3.4.3. OptimizingtheresourceallocationinLTE-A/5Gnetworks The overall systemperformance evaluation inadvanced wire-lesssystems,likeLTE,dependsonKPIs. Inaquesttoenhance the user experience, the authors of [77] proposed an approach that utilizesuserandnetworkdata,suchasconfigurationandlogfiles, alarms,anddatabaseentries/updates.Thisapproach reliesonthe useofbigdataanalyticstoprocesstheabove-mentioneddata.The ultimategoal istoprovidean optimalsolutionto theproblemof allocatingradioresources toRAN users,andguarantee aminimal latencybetweenrequesting theresource andallocatingit.This is donethroughuserandnetworkbehavioridentification,whichisa taskwell-matchedforbigdataanalytics.

Theproposedframeworkinvolvesthreestages:

First stage: This process is carried out in the eNB system, processing the data from the cellular and core network side. Binary values are acquired by comparing cellular level KPIs to their respectivethreshold values,thus keeping thebinary matrix updated.Thisprocedureisrepeatedatﬁxedintervals.

Secondstage:Repeatthesamestepsasintheﬁrststage. How-ever,thisprocessiscarriedoutonsubscriberleveldatatoacquire subscriberKPI,andmaintainabinarymatrix.

Third stage: This is activated when a user initiatesa resource allocation request. A binary pattern is generated based on the userrequirements.Thispatternislaterhanded overtostage2to update the binary matrix (if required) and incorporate the new valuesintherowthat representsthe requested bandwidth.After generating the updated row, it is transferred to the ﬁrst stage for comparisons with the current Physical Resource Block (PRB) groups. To identify which PRBs suit the user, the fuzzy binary pattern-matchingalgorithm [78] wasusedforthatpurpose.

Using thisalgorithm,the executiontime increasedlinearlyfor anexponentialincreaseinthenumberofcomparisonpatterns.

3.4.4. FrameworkdevelopmentforbigdataempoweredSONfor5G The authors of [79] proposed a framework called Big data empowered SON (BSON) for 5G cellular networks. Developing an end-to-end network visibility is the core idea of BSON. This is realized by employing appropriate machine learning tools to obtainintelligencefrombigdata.

Accordingtotheauthors,whatmakesBSONdistinctfromSON arethreemainfeatures:

• Havingcompleteintelligence onthestatusofthe current net-work.

• Havingtheabilitytopredictuserbehavior.

• Havingthe abilitytolink betweennetworkresponse and net-workparameters.

The proposed framework contains operational and functional blocks,anditinvolvesthefollowingsteps:

1- Data gathering: An aggregate data set is formed from all the information sources in the network (e.g., subscriber, cell, and corenetworklevels).

2- Datatransformation:Thisinvolvestransformingthebigdatato therightdata.Thisprocesshasseveralsteps,startingfrom:

a. Classifying thedata accordingto keyOperational and Busi-nessObjectives(OBO),suchasaccessibility,retainability, in-tegrity,mobility,andbusinessintelligence.

b.Unify/diffuse stage,andthe resultof thisstage ismore sig-niﬁcant KPIs,which areobtained byunifying multiple Per-formanceIndicators(PIs).

c.According to the KPI impact on each OBO, the KPIs are ranked.

d. Filtrationis performedon theKPIsimpacting the OBOless thanapre-deﬁnedthreshold.

e. Relate, foreach KPI andﬁnd the Network Parameter (NP) thataffectsit.

f.Orderthe associatedNPforeachKPI accordingto their as-sociationstrength.

g. Cross-correlate each NP by ﬁndinga vector that quantiﬁes itsassociationwitheachKPI.

3- Model: Learn fromthe rightdata acquired instep 2 that will contributetothedevelopmentofanetworkbehaviormodel. 4- Run SON engine: New NPs are determined and new KPIs are

identiﬁedusingtheSONengineonthemodel.

5- Validate:IfanewNPcanbeevaluatedbyexpertknowledgeor previous operator-experience, proceed withthe changes. Oth-erwise,thenetwork simulatedbehavior fornewNPs is deter-mined.IfthesimulatedbehaviortallieswiththeKPIs, proceed withthenewNPs.

6- Relearn/improve: If the validation in step 5 was unsuccessful, feedbacktotheconceptdrift [80] block,whichwillupdatethe behaviormodel.Tomaintainmodelaccuracy,conceptdriftcan betriggered periodicallyevenifthere wasapositive outcome inthevalidationstep.

3.4.5. User-centric5Gaccessnetworkdesign

Enhancingthe user experience, givinga higher datarate, and reducingthelatencyareconsideredthekeygoals ofa5G system. Theauthorsof [27] expectthefollowingtobethekeyelementsin thedesignofauser-centric5Gaccessnetwork.

3.4.5.1. Personalized local content provisioning. It is important for theaccessnetworktoevolvefrombeinguserandserviceagnostic, byactingmerelyasablindpipethatconnectstheusertothecore network, to becoming user-centric. The latter term would direct the network access to the right path of being user and service aware, thus facilitating a key technology in 5G, which is local content provisioning, this in-turn would pay off in the form of end-to-endlatencyreductionandenhancingtheuserexperience.

The role of big data analytics becomes very clear, as it pro-vides a necessary ability of predicting user requirements. These requirementscanbemetinthecaseoflocalavailability.

The authorsof [27] proposed severalsteps to achieve content provisioning,andtheyareasfollows:

• Trafficanduserinformationacquisition:Trafficattributes (appli-cationtype,serveraddress,andportnumber,etc.)arecollected throughpacket analysis, andanalyzedusing aclustering algo-rithm (e.g., k-means [81] ) to perform traffic labelling (news, sports,andromance,etc.).

(10)

ﬁlteringwasrecommended [82] ), whichutilizesthe traﬃc at-tributesmentionedabove,andthusrecommendscontentbased onuser’sinterests.

• Localcachingand contentmanagement: Popularcontent is pro-videdintheformoflocalcopies.Contentthat mightbeof in-terest tothe usersare to be locallycached afterbeing down-loadedfromtheapplicationserver.

• Content provisioning: When an applicationrequest is initiated bytheuser,thesystemwillcheckifthecontentisalready lo-cally cached, so it can be sent directly. Furthermore,big data analysis can give content recommendations, thus the system willcheck ifthesecontentsare cachedlocally,andpushthem directlytotheuser.

3.4.5.2. Providingﬂexiblenetworkandfunctionalitydeployment. The authors of [27] noted that processing the characteristics of both the regional user andservice using big data analytics can be of greathelpinﬂexiblenetworkandfunctionalitydeploymentinthe followingways:

• Flexiblenetworkdeployment

Since 5G will support diverse low-cost Access Points (APs), such ascoverage APs and hot spot APs, usingbig data analytics can beusefulto forecastthe traﬃccharacteristic,henceestablish thebaseforachievingadynamicnetworkdeploymentfortheAPs.

• Flexiblefunctionalitydeployment

Analyzing and predicting regional user and service require-ments canberealizedusingbigdataanalytics.Thiswill formthe foundationof dynamicfunctionalitydeployment, whichwill help decide wheretodeploy certain functionalitymodules (e.g.,safety moduleswheretherearesecurityrequirements).

3.4.5.3. Usinguserbehaviorawarenesstooptimizewirelessresources 5Gnetworks. Optimizationofwirelessresourcesshouldbe carried-out according to the user and 5G service requirements. This is donetoenhancetheuserexperienceandimprovetheeﬃciency.

According to what the authorsof [27] proposed, bigdata can be used to analyze user mobility patterns, predict the motion trajectory, andhencepre-conﬁgurethe networkaccordingly.Each user’shistoricalAccessPoint(AP)listisrecordedbytheAPs.This data can be uploaded to a central module for processing, or to a target APin casethe service APwas altered. Using a bigdata analyticsalgorithm, thecollecteddata isanalyzed toforecast the motiontrajectories.

3.4.5.4. Big data-based network operation system. The authors of [27] proposed a system that can maximize the eﬃciency of big databasednetworkoperation.Thiscanbefulﬁlledbyoptimally al-locatingthenetworkresourcestoeachAPanduser.Theproposed systemconsistsoftwoparts:

1- Decisionmakingdomain

This is responsible for collecting and managing the user, network upgrades, conﬁguration, service, and terminal state in-formation. This domain exploits big data analytics to provide the basic conﬁguration essential for network initialization. For this domain to function properly, it would need to realize the wholepictureoftheuserandservicerequirements,aswellasthe functionalitydistributioninthenetwork.

2- Implementationdomain

Its responsibility is mainly for status reporting of the user, terminal, and network, dynamic deployment, and network con-figuration. Depending on the requirements after acquiring the personalizedconfiguration,thisdomain canuse thedynamicAPs functionalityandconfigurationtobuildmulti-connectivitybearers withterminals.

3.4.5.5. Multi-RAT orHetNetenergysaving. Smallcellsare usedin multi-RadioAccess Technology (RAT) orHetNetmobile networks. For energy saving purposes, when traffic falls below a certain threshold, small cells may enter into a dormant state, and this forces the small neighboring cells to serve the dormant cell’s coverage area. Several energy saving schemes were discussed in [83] . However, these schemes failed to adapt to the dynamic temporal and spatial traffic variations, as they operate under a relativelylongtimescaleleadingtounacceptabledelay.This hap-pens because when a number of newly arrived User Equipment (UE)seekaccesstothenetwork,wherecellactivationisessential, thecorrespondingsmallcellsrequirestobe firstactivated,before operating normally. Only at that point can the UEs access the networkusingthestandardprocedure.UEs,however,wouldsuffer fromanunacceptabledelaywhentryingtoaccessthecells.

The authors of [27] proposed a user-centric approach that is based on big data analytics. The proposed solution claims to achieveoptimalimplementationfortheactivationofsmallcellsas wellasUEaccesstothenetwork.Thus,jointoptimizationforboth UEs’accessandenergysavingcanbeachieved.

3.4.6. Networkﬂexibilityusingconsumptionprediction

Consumptionanalysisisconcerned withtwo factors:customer locationsandtypeofservice.Consumptiontrendscanbeclassiﬁed inatimelymannerintolong-term,seasonal,andshort-term.

To reach an accurate prediction, the authors in [68] implied that user data (e.g., GPS location and service usage) can be cor-related with other data (e.g., news, social network, events, and weather conditions). Using big data analytics to analyze these correlations, operators wouldbe able to decide when andwhere toplacetheirnodeswithoutaffectingthesubscribers’satisfaction.

4. TheroleofbigdataanalyticsinSDN&intra-datacenter networks

SDNoffersthe abilityto programthe networkwitha central-ized controller, thiscontroller is capable of programmingseveral dataplanesusingonestandardized openinterface,thus providing flexiblearchitecturalsupport [1] .Thefollowingresearchtopics uti-lizedthepropertiesofbothSoftwareDefined Network(SDN)and big data analytics by employing the analysis results to program thenetwork. Those topicscan be classifiedaccordingto thearea underdiscussionasfollows:

1-Traffic prediction: The paper surveyed in this section employ trafficpredictiontooptimizenetworkresourceallocation. 2-Traffic reduction: Pushing the aggregation from the edge

to-wardsthenetwork.

4.1. Traﬃcprediction

4.1.1. CognitiveroutingresourcesinSDNnetworks

The network’s next generation has to be smart and ﬂexible, with the ability to modify its strategy according to the net-work status in an automatic manner. To simplify the network management, SDN has made the above requirement possible by decoupling the control and forwarding planes through the OpenFlowprotocol [84] .

OpenFlowisconsideredtobethefirststandardizedprotocolin SDN,itisalsoidentifiedastheenablerofSDN.SDN/OpenFlowhas influenced Google to switch to OpenFlow inits inter-data center network, which resulted in an approximate 99% increase in the averageGoogleWANlinkutilization [85] .

Thearchitecture proposed by theauthorsof [85] included the followingparts:

(11)

The authors adopted the Hadoop platform to realizethe pre-diction functionality. They utilized the analyses of both network traffic and user application information to find each application flowdistribution.Foreach dataflow, they founda specific distri-butionlaw.Theyanalyzed thatlaw, andfordifferentapplications andareas theydeveloped apreliminarygeneralprediction model tofitthecaseofthesameapplicationbutindifferentareas.

2- InterfacedesignbetweenSDNcontrolleranddatabase

A cloud platform is responsible forcalculating and predicting theflowdistributionvaluesofeachOpenFlowswitch.Inaddition, this platform will read the link information and perform traffic prediction.Adatabasewill holdthe recordedvaluesandthe last predictedvalueswillbe updated.Toensure thattheallocation of resources accommodates the traffic variation, Floodlight (a Java-basedSDNcontrollerthatcanaccommodatedifferentapplications by loading different modules) will read the newest predicted valuesfromthedatabaseregularly.

3- SDNcontroller-basedroutingmodule

The predicted values are used as preferences to select the bestroute. Theresearchers usedan improvement onthe Dijkstra algorithm.

Application awareness and the prediction of user preferences were integrated into SDN through the newly proposed archi-tecture, which would facilitate and enhance network resources allocationandprovidebetterapplicationclassiﬁcation.

Theroleofbigdataisexemplifiedbyitsabilitytousenetwork flowanalysesandusers’behaviortoforecast thetypeandrateof theincomingtrafficflows.

Theprocedureproposedbytheauthorsof [85] isasfollows:

1- Currentnetworkload(i.e.,traﬃcvolumeandtype,etc.)isread byacloudplatform.

2- The overall traﬃc is predicted in advance and gathered in a database. Thisisdoneusinga bigdata-poweredprediction al-gorithm.

3- TheSDNcontrolleraccessesthedatabaseandreadsthestored datamentionedinsteptwo.

4- Aresource allocationscheme iscreatedby theSDNcontroller usingtheabove-mentionedroutingalgorithm,andthisscheme issentlatertotherelatedswitches.

Big data analyticscan use the users’requirements to provide the network with a dynamic resource allocation and applica-tion classiﬁcation, hence providing the network withbetter load balancingtechniques.

The resultsfrom theimplemented testbed showed the ability ofthe proposed solution to self-adapt towards ﬂow variation by dynamically issuing traﬃc tables to the related switches, which can increase the resource utilization and attain an improved overallloadbalance.

4.1.2. Predictingdatacommunicationvolumeatruntimeindata centernetworksusingSDN

Networksthat dealwithbigdataapplicationsmaysufferfrom the size and speed of data. For example, the networks’ overall response time is affected by MapReduce’s heavy-communication phase. This problem can be intensiﬁed if the communication patternsexperienceaheavyskewimpact.Theauthorsof [86] have proposed Pythia, which is a system that can optimize the data centernetworkattheruntimeby utilizingtherealtime commu-nicationpredictionofHadoop. Italso mapstheend-to-end ﬂows totheunderlyingnetwork.

Pythia utilizestheSDN-offered programmabilitytoachieve ef-ﬁcientandtimelynetworkresourceallocationforshuﬄetransfers.

Depending on the network workload and blocking ratio, the HadoopworkloadsawaconsistentaccelerationwhenusingPythia, and job completion time was reduced between 3% and 46% in comparisonwithMapReducebenchmarks.

4.2. Traﬃcreduction

4.2.1. Increasingnetworkperformancethroughtraﬃcreduction Large networks, such as those of Google and Facebook, or even small andmedium sized enterprise networks sufferfrom a plethora of traﬃc. This happens due to the colossal amount of databeingprocessedeitherinbatchorrealtimeapplications.

A common solution is to increase the available bandwidth in theenterpriseclusters.However,theauthorsof [87] proposed an-other approach that improves thenetwork performance, pushing the data aggregation from the edge towards the network, thus decreasingthetraﬃc.

A platform called CamCube [88] was used; it substitutes the use of dedicated switches by distributing the functionality of the switch across the servers. It is worth noting that CamCube offers theability tointerceptandmodify packets ateach hop.In addition, it uses a direct-connect topology, in which, a 1 Gbps Ethernetcross-overcableisusedtoconnectserverstoeachother inadirectway.

Exploiting CamCube’s properties to realize high performance, Camdoop, whichis a CamCube servicethat runs MapReduce-like jobs, was used. It offers full on-path data stream aggregation. Camdoop buildsaggregation treeswhere thechildren are resem-bledbytheintermediatedatasources,whiletherootsarelocated attheserversperformingtheﬁnalreductionintraﬃc.

AsmallprototypeofCamdooprunningonCamCubewastested, andasimulationwasusedtoshow thatthe samepropertiesstill hold at scale. The results showed a signiﬁcant traﬃc reduction with the proposed system when compared to Camdoop running on a switch and when compared to systems like Hadoop and Dryad/DryadLINQ [56,89] .

5. Theroleofbigdataanalyticsinopticalnetworks

This section discusses research papers that employ big data analytics for optical network design, the topics are classiﬁed as follows:

1- Network optimization: Here the parallel processing merits of Hadoopare utilized to reduce the executiontime ofdifferent (bin-packing)optimizationalgorithms.

2- Traffic prediction: Using big data analytics to dynamically re-configurethenetworkaccordingtopredictedtraffic.

5.1. Providingsolutionstonetworkoptimizationproblems

5.1.1. SolvingtheRWAproblem

The Routing and Wavelength Assignment (RWA) algorithm [90] plays an important role in optical networks. The authors in [91] consideredtheRWAalgorithmtobeanillustrationofthe bin-packingproblemthatislistedasaclassicalNP-hardproblem [92] . The authors in [91] used a Hadoop cloud computing system that consisted of 10 low-end desktop computers to indepen-dently run an instance of the RWA algorithm on each of them for a certain number of demand sequences. The goal is to suﬃ-ciently evaluatethe demandsequence within ashort period.The procedureisasfollows:

1- AninputﬁleisfedtotheHDFS,itincorporatestheinformation ofthelightpathdemandrequests.

(12)

from 0to 19 that later serve asrandom seeds in the reduce functions. It is worth noting that the authors set two reduce functionspercomputer(i.e.,atotalof20reducefunctions). 3- Thekey-valuepairsarethenforwardedtothe20reduce

func-tionswhereparallelcomputingisconducted.Thelightpath de-mandlistisshuffled250timesinarandomfashion(i.e.,5000 shuffled demand sequences), and this happens for each key-value pair within each reduce function. To acquire the num-berofrequiredwavelengths,theRWAheuristicisrunforeach ofthe shuffleddemand sequences.The optimalresultof each reduce functionisthen comparedagainst theremaining 19to findtheglobaloptimum.

Different test networks(ranging from20 to 500 nodes) were used to evaluate the performance of the Hadoop system. The resultswere optimality evaluated bycomparing themagainst the resultsofanIntegerLinearProgramming(ILP)optimizationmodel and they showed a close proximity to the optima (except for two cases). Itis worth notingthat theILP approach assumesfull wavelengthconversion,whichplays theroleofthe performance’s lowerboundinthepresentevaluation.

5.1.2. SolvingmultipleoptimizationproblemsusingHadoop

The authors in [93] proposed to solve several optimization problemsintheopticalnetworkparadigm.Theproblemsare:

1- Energy minimization problem [94] ,where thegoal is to min-imize the overall network power consumption from non-renewableenergysources.

2- SharedBackupPathProtection(SBPP)–basedelasticoptical net-work planningproblem [95] , wherea heuristic used the con-ceptofSpectrumWindowsPlanes(SWPs).Theobjectivewasto minimizethemaximumnumberofFrequencySlots(FSs)inthe network.

3- Adaptive Forward Error Correction (FEC) assignment problem [96] ,wherethegoalistomaximizethetotalnumberofFSs uti-lizedforuserdatatransmission.AheuristicbasedonSWPswas developedtosolve theRoutingandSpectrumAllocation(RSA) problem.

Theaboveproblemsareofabin-packingtypeandclassifiedas NP-hard. Several aspects (i.e. demand size and route) should be consideredwhenservingnetworktrafficdemands.Duetothehigh computational complexityand the order ofserved demands, the performanceofheuristicalgorithmstryingtosolvetheseproblems cannot be guaranteed.Thisisbecause ofusingthesimplelargest to smallest ordering strategy. Good performance can be attained by randomly shuffling demand sequences, then implement a heuristic algorithmfor eachsequence andchoosingthe one with the optimum performance. Toshorten thecomputation time and to overcomethecomputationalcomplexity,a Hadoopcloud com-putingsystemconsistingofsevencomputerswasproposedbythe authors in [93] . This way,a heuristic algorithm can be executed formultipleshuffleddemandsequencesinaparallelmanner.

The Hadoop MapReduce platform makes it possible to eval-uate multiple shuffled demand sequences in parallel. A heuristic algorithm serves each of the shuffled demand sequences and a resultis produced eachtime. The resultsare then compared and theonewiththebestperformanceischosen.Thesameprocedure isrepeatedon eachReducefunction. Thefinal globaloptimumis foundbycomparingtheresultsacrossallreducefunctions.

Performance evaluation is done by employing two test net-works;the24-node,43-linkUSNETnetwork(adoptedforproblems 1 and 3) and the 11-node, 26-link COST239 network (used for problem 2). For the ﬁrst optimization problem, the total con-sumption of non-renewable energy decreased by 8% (when the number of shuﬄed demand sequences increased from 1000 to

10,000). As forthe second optimization problem, the number of requiredFSs wassigniﬁcantly reduced. In thethird problem, the totalnumberofFSsforuserdatatransmissionwasincreasedand 3% performance improvement was noted when compared with thecaseofrunningHadooponasinglemachine.Thecomputation timeforall threeproblemswassigniﬁcantlyshortercompared to asingle-Hadoopmachine.

5.2.Dynamicnetworkreconﬁguration

5.2.1. VNTadaptabilityusingtraﬃcprediction

The emergence of new services is placing new demands on networks in terms of large and dynamic bit rate requirements. This caused network operators to look for a Virtual Network Topology (VNT) architecture that can cope with the anticipated traﬃc ina dynamicfashion. One solutionis realizedby overpro-visionthenetwork,however,thedownsideistheincreaseinTotal Costof Ownership(TCO).Another solutionsaves power by using threshold-based capacity reconﬁguration [97] . The drawback is thatthereis nosavinginthenumberoftranspondersthatneeds tobeinstalledineachIProutercomparedtooverprovisioning.

An alternative solution is proposed in [59] where VNT re-configuration can be attained regularly using big data analytics. This is done by periodically analyzing Origin-Destination (OD) traffic sothat VNT reconfiguration can be performedaccordingly. Collectionoftraffic monitoringdatatakesplaceatedgeIProuters regularly.Asetoftrafficsamplesiscollectedbyeveryedgerouter for every other destination router. These sets are stored in a collected data repository. According to a predefined time period, thecollected datais then summarizedfor each ODpair by peri-odically retrieving the collected data repository and performing datastreamminingsketches.The resultofthisstageisamodeled datarepositorywhichincludes, amongothers,maximum,average, andminimumbitrateforevery OD pair.Using machinelearning techniques (i.e. Artificial Neural Networks), a prediction module generatesthepredictedODtrafficmatrixfortheupcomingperiod. ThedecisiononwhethertoperformVNTreconfigurationornotis determinedby a decision-makermodule by relyingon the above-mentionedmatrix.Ifareconfigurationisrequired,aVNToptimizer is provided with both currentand predicted OD traffic matrices. The solution is fedback to the network controller to implement therequiredchangesinthedataplane.

The performance of the proposed solution was compared against the static and the threshold-based methods. An overall savingbetween8% and42%inthe numberofinstalled transpon-derswasachieved.The proposed solutionhasthe abilitytoreact duringlowtraﬃchoursbydeactivatingtransponders,thus,energy savingisattained.Also,theadvantageofcostreductionisattained byreleasinglightpathsfromtheunderlyingopticallayer.

6. Theroleofbigdataanalyticsinnetworksecurity

6.1. Peer-to-PeerBotnetdetection

ManysecurityproblemsontheInternet arecausedbyBotnets, whichcan be definedasnetworks ofmalware-infectedmachines controlledbyanadversary [98] .Botnetattacksformarealsecurity concern, with the ability to utilize 90,000 IPs in an attack [99] , itisa challengeon aninternationalscale,especially whentaking intoconsiderationthefinancialdamagetheycaninflict.