• No results found

Managing Incompleteness, Complexity and Scale in Big Data

N/A
N/A
Protected

Academic year: 2022

Share "Managing Incompleteness, Complexity and Scale in Big Data"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Managing Incompleteness,

Complexity and Scale in Big Data

Nick Duffield

Electrical and Computer Engineering Texas A&M University

http://nickduffield.net/work

(2)

Three Challenges for Big Data

•   Complexity

–  Problem: high-dimensional data with complex dependence between variables, difficult to model

–  Solution: machine learning dominant relationships

•  Incompleteness

–  Problem: not all quantities can be directly measured –  Solution: statistically infer what we want from what we

have

•  Scale

–  Problem: huge datasets: costly to store, slow to compute –  Solution: smart data reduction retains ability to answer

most important queries

(3)

Big Data Complexity: Customer Experience

•  Which objective metrics closely associated with customer dissatisfaction?

–  If known, remediate and prevent future troubles

•  Solution:

–  (machine) learn metrics (and values), service settings, most associated with occurrence of customer calls. Set action thresholds.

–  Monitor metrics, take action when thresholds exceed

•  Operational savings

–  Reduce call volume to customer care center, reduce churn

•  Reverse problem

–  Learn calling patterns and keywords most predictive of network problem

Objective Metrics of Network Performance

Noisy

measures of customer experience

Packet loss and delay;

line quality;

service parameters

Customer care calls; social media; keyword analysis

?

(4)

Incompleteness: Internet tomography

•  What ISPs want

–  Origin-Destination (OD) traffic rates between any two routers

•  What ISPs have

–  Measured traffic rates on each link

•  Linear relation

–  Link_Rates = A . OD_Rates –  A = routing matrix

•  encodes which links that OD traffic traverses

•  Solve? Under-constrained problem

–  Different possible sets of OD_Rates yield the same set of measured Link_rates

(5)

Internet Tomography

•  Gravity Model?

–  OD_Rate(A à B) =const. × Rate(AàALL) × Rate(ALLàB) –  Can measure Rate(AàALL) at links emanating from A

•  Problem with gravity!

–  Gravity model is not a solution of Link_Rates = A . OD_Rates

•  Solution: Tomogravity

–  Use solution closest to gravity model!

•  Penalized likelihood solution

–  Quick to compute, good accuracy

•  In daily use in ISPs, Routers

constraint subspace L = A.M Tomogravity =

least square solution

M1 M2

gravity model solution

(6)

Big Data Scale

•  ISP operations generate 100s of Terabytes of usage measurement data daily

•  Passive traffic measurements by (core) routers

–  Session-level traffic summaries (flow records)

–  Each flow record reports IP source and destination,

#packets, bytes, timing, ..

–  Core routers stream flow records to collectors for analysis

•  Used widely in network management

–  timescale from months (planning) to seconds (security)

•  Still need tomo-gravity outside core!

(7)

Managing Data Scale through Sampling

•   Turn Big Data into Smaller Data

–  Savings in storage, bandwidth; speed up queries

•  Reference sampling

–  Reuse samples over multiple retrospective queries –  Know query class in advance, but not specific query

•  “Smart” sampling

–  matches data characteristics to analysis requirements –  E.g. uniform sampling is useless on heavy tails

•  Streaming constraints

–  Sample to be computable in small time per item

–  Big data constraint often not met in classical methods

(8)

Statistically Optimal Stream Sampling

•   Aim:

–  Sample fraction of flow records

–  Use to answer queries approximately

•   Problem: heavy tails

–  10% of the flow records report 90% of bytes –  Uniform sampling misses most of the 10%

•  Big hit on accuracy

•   Solution:

–  Statistically optimal non-uniform sampling algorithms (minimal estimation variance)

–  Computationally feasible for stream sampling

–  In use in ISPs

(9)

Taming the Heavy Tail

•  Distribution of traffic estimates

Uniform sampling Smart sampling

(10)

Next: Streaming ISP Graph Data

•   ISP Communications Graph from Flow Records

–  node = IP address;

–  edge = flow from source to destination

compromise control

flooding

•  Hard to detect against background

•  Known attacks:

–  Signature matching based on subgraphs, flow features, timing

•  Unknown attacks:

–  exploratory & retrospective analysis

•  Smart sampling of subgraphs

(11)

Sampling + Knowledge Discovery

•  Interplay between sampling and data mining is not well understood

–  Need to understand how ML/DM algorithms are affected by sampling

–  E.g. how big a sample is needed to build an accurate classifier?

–  E.g. what sampling strategy optimizes cluster quality

•  Expect results to be method specific

–  I.e. “smart samping + k-means”

(12)

Sampling and Privacy

•  Current focus on privacy-preserving data mining

–  Opportunity for sampling to be part of the solution

•  Naïve sampling provides “privacy in expectation”

–  Your data remains private if you aren’t included in the sample…

•  Intuition: uncertainty from sampling contributes to privacy

–  This intuition can be formalized with different privacy models

•  Sampling can be analyzed in the context of differential privacy

–  Sampling alone does not provide differential privacy

–  But applying a DP method to sampled data does guarantee privacy –  A tradeoff between sampling rate and privacy parameters

•  Understand benefits as well as risks of information flows

Network calculus of risk/reward trade-off from information sharing, joining

(13)

Outlook

•   Big data challenges

–  Incompleteness, complexity, scale

•  Generic problems; transferable solutions

–  Find causal relations in high dimensional data

•  Use machine learning for discovery & prediction

–  Big Data Tomography

•  Solve ill-posed inverse problems with constraints from models and side data

–  Smart Sampling

•  Speed up computations and save on resources

•  Tune sampling to mediate between data and queries

–  Role of sampling in ML/DM, privacy,…

References

Related documents

From these results, it is clear that women experience disrespectful maternity care by some healthcare workers, particularly by female staff. The strong support for the presence

Negative-pressure houses with built-up litter presented higher emission rates during the first rearing week due to the high NH 3 concentration during the brooding period, when

• Display the dependent member • Click the Edit Member Record button • Click Admin to enable the contract fields • Click the Set Responsible Member button.. • Enter

Although self-monitoring requires some time and commitment, clients who have self-monitored their drinking report that it provides a better understanding of how much they drink and

Como veremos más adelante (§ 4.3.2), este enfoque, que con- figura la violencia de género como doblemente unidireccional, respecto a los autores (solo hombres) y a las víctimas

Second team honorees included junior catcher Mike Meeuwsen of Grand Rapids, Mich., and sophomore second baseman Matt Klein of DeWitt, Mich.. Ruby and Labbe were also named

The first research question asks “How was resolving the Syrian Crisis framed in the Syrian, American and Russian media during 2015 and 2017, and how does such framing

This mu~ltiplier effect is illustrated in Figure 1, which also shows how the long-run nominal interest rate and the long-run average inflation rate implied by the monetary policy