Big Data: Quelques Enjeux Techniques

(1)

Thales Communications & Security

Big Data: Quelques Enjeux Techniques

Essai de Typologie des Problèmes de Big Analytics

J.F. Marcotorchino

(2)

2 /

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

BIG DATA/BIG ANALYTICS

(3)

3 /

Definitions

Big Data: All the technologies and techniques that help

scaling

Large File Storage (virtual)

Distributed processing (Hadoop) / Map-reduce

NoSQL databases / simple & complex query

Big Analytics: Techniques that are executed on a BigData

infrastructure and have the following properties:

Adaptation of ad hoc techniques (statistics-learning) to this

environment

Scales Linearly (O(N) or O(NLog(N)) order of magnitude or subject

to heavy potential parallelization

Linearization is mandatory either at “criteria level” or at

“constraints polytopes level”

Use special type of learning techniques through dimensions

reduction.

(4)

4 /

Les 4 V

The 4 V Challenge

_{Volume :}

_{Large Storage Capacity are available now}

_{NAS type (Network Attached Storage):}

_{Virtualized Storage}

_{Cloud Computing}

_Velocity:

_{Large Demand for Immediate results}

_{Stream Analytics for SEP/ CEP (Stream &Complex event processing)}

_{In memory Computations adapted to Key-Value stores}

_Variety:

_{Large Diversity of Heterogeneous Data Types}

_{Structured Data (classical DB entries) or Semi Structureed Data}

(Images with meta data added)

_{Unstructured Data: Text, Speech , Raw Images etc}

_Value:

_{Intrinsic Value of the couple « Data/Information » is}

now recognized by Business companies

la (((*valeur «

α

N » (

α

entier) on doit répartir les calculs sur

α

machines pour conserver

(5)

5 /

Some Confusions to Avo

i

d

Do not confound :

Combinatorial Complexity

vs

Indexing

complexity, difficulty of IT computations

vs

the

management of huge data volumes (HPC vs BIG DATA)

In the first case:

It is not the data amount per se which is a drawback, but the

intrinsic combinatorial structure of the problem to solve

:

Example:

≅≅≅≅

10 29300

solutions (Berendt -Tassa estimate 2010) to

explore for clustering a set of

N=10000

objects or individuals.

Nevertheless

N=10000 is not a huge amount

In the second case:

It is the data amount

itself

which poses a problem, through

the structure of the indexing and storing architectures.

(Difficulty due to the scalability constraints)

(6)

6 /

How to address Scalability Problems

Scalability by

«

Linearization

»

VS Scalability by

«

Parallelization

»

_{In the First Mode}

_:

If for a population of

N

objects the needed computing time is

T, in case of a linear

algorithm it will take a computing time

≅≅≅≅ αααα

T

if the population size jumps up from

N

to

αααα

N. _{In the Second Mode}

_:

If an algorithm

dedicated to a population size

N

can be

processed on a SINGLE machine within a time

T, then if the

la population scales up to

αααα

N

(

αααα

integer ), computations can

be distributed on «

αααα

» machines to keep a computing time

equal to :

T

Combination of both modes is the best possible approach

(if suitable)

(7)

An Operational Characterization of Big Analytics Methods

Big Data Analytics

: «

Extended

» VS

«

Intrinsic

»

cases

_{« Extended » Case:}

_{Possible use of the NoSQL storing architectures, or new SQL ones}

_{Exhaustive Analysis of the whole data set is}

_{not mandatory at all}

_{« Analytic Sampling » or « Big Sampling » are sufficient in most cases:}

e.g:

Customers Segmentation, CRM, Cross selling , Churn & Attrition Analysis,

Intrusions Analysis or HUMS (Health & Usage Monitoring Systems)

.

_{The remaining set of the population except « samples » is processed by}

(8)

An Operational Characterization of Big Analytics Methods

Big Data Analytics

: «

Extended

» VS

«

Intrinsic

»

cases

_{« Intrinsic » Case:}

_{It is mandatory to rely on the full data se}

_{t (}

_exhaustivity

_),

_{even if avoiding}

to do it , is still remaining a research topic

_{No a priori knowledge , or partial knowledge of the population structure}

_{Data are stored through NoSQL architectures using the adequate}

correspondence formats (

example for graphs DB: Neo4j , FlockDB (

open

source distributed, fault-tolerant graph database for managing data at scale., chosen

by Twitter

)

_{To manage the exhaustivity constraint, obligation to use heuristics or meta}

heuristics based upon

linear iterations

,

or parallelization through

distributed computations

(9)

Some NoSQL DB Types

Key Value Stores

Column Oriented DB

Document

Document Oriented

Oriented

Oriented DB

DB

BigTable (

Google

)

(Faceboo

Faceboo

k

)

Infinity DB

((((

Amazon

))))

DynamoDB

Graph Data Bases

Neo4j

Complex grows like

E

Rel

E

=

nb. of Entities

Rel

= average relationships /

average relationships /

entity

(10)

direction ou services

BIG DATA CONCEPTUAL FOUNDATIONS

[Brewer CAP Assignment]

It is impossible to

satisfy the 3 items

choose 2

Consistancy

A

P

C

A

CP

MemcacheDB /Bekerley DB

Voldemort

CouchDB

HBase

Availability

Partition Tolerence

(11)

C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0

Some ideas for solving Intrinsic Big Analytics approaches

Use mainly

exhaustive methods

(if possible no statistical

sampling) (Data Driven vs Hypothesis Driven )

Affinity Analysis & Sequential Patterns

₍

_{pure linear matchings scalar products}

₎

Use Classifiers with linear criteria

Practice Iterative Queries

_R

2 _I

2 _:

_{Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité}

Régularisée + Clustering « on the fly »)

Unsupervised Clustering (no a priori)

₍

_{Extending « No K-Means » approaches using}

linear relational criteria

)

Text mining

(

word spotting

)

Reticular Data Analysis

(Social Nets, Huge IT Networks)

(12)

12 /

(13)

Reticular Data Structuring

Classical BI Data Mining

Tentative structuring of Big Analytics Approaches

Learning &Neural Nets

Vector Matching Structuring

Lack of Population Knowledge

L

e

v

e

l

o

f

P

ro

b

le

m

C

o

m

p

le

x

it

y

Learning Model for

unsupervised Classif

Limited Layers Neural Nets

Naïve Bayes

Networks

Self Encoded and Hourglass

Shaped Neural Nets

Image & Video

Analytics

Sequential Patterns Recognition &

Affinity Analysis

Parallel Coordinates

Unsupervised Clustering

Large Networks Topological

Design

Supervised Rule Based

Classification

Social Networks

Communities detection

Reticular Visual

Analytics

BiClass SVM

Faces &Pattern Recognition

Piecewise Linear Regression

Multi Classes

SVM

MOLAP and XOLAP

MDL Learning

(14)

An Example of Intrinsic Big Analytics Problem: Graphs Modularity

Girvan-Newman’s Quadratic formulation

“Liberal”

“Conservative”

“Centrist”

Krebs’ Graph on American Politics

S. Mandal (MIT)

MIT Heuristic Algo: Construct the modularity matrix and find its largest eigenvalue and

eigenvector

• Partition network into two parts based on signs of elements in the largest eigenvector

• Repeat for each part

• If a proposed split does not cause modularity to increase, declare subgraph indivisible and do

not split it

• When entire graph consists of indivisible subgraphs, stop

Typical running time

_O(N

2 _{log N)}

_{for a}

sparse graph

modularity of network is “the

number of edges falling within

groups minus the expected

number in an equivalent

network with edges placed at

random.” (

“Deviation to

Independence”

)

• Maximizing modularity

rigorously may be NP-hard

(15)

By relational transform we turn the criterion into

a

linear

function subject

to

linear

constraints

Idea

: relying on

the locally linear

« Louvain » algorithm

(Blondel-Guillaume

)

(Univ Louvain/UPMC LIP6) , use the Linear Relational Form

_{O(N LogN )}

We can do more

: using the genericity of the Louvain ’s algo we can use

better linear criteria than the Girvan-Newman’s one based on

Optimal

Transport justifications

e.g:«

Deviation to Indetermination

» (Patricia

Conde- Cespèdes )

X

_ij

– X

_ji

= 0

∀

(

i,j

) (Symmetry)

X

_ii

= 1

∀

i

(Reflexivity)

X

_ij

+ X

_jk

– X

_ik

≤≤≤≤

1 ∀

∀

(

i,j,k

) (Transitivity)

(16)

Big Analytics :Some Topics of Interest

Big Analytics for

Cyber-Security

Big Analytics for

Smart Transport

Big Analytics for

National Security

Big Analytics for

maintenance:

Components for attack detection and investigation

(Intelligent IDS from normalized log analytics, IS passive

and dynamic mapping, logs analytics, cyber Intelligence

)

Attack detection from relational & content data, intelligent IDS

and sandbox coupling,

Intelligent coupling with IS passive and dynamic mapping

Big Data platform for logs analytics, visual analytics

Business Analytics Web portal for passenger behaviour

and profile understanding , traffic anomaly detection:

New components and use cases focused on mobility

Approach based on space-time queries, BI, early warning

engine, Big Analytics and optimization technics for Smart City

Fraud detection

Social Web Intelligence for National Security :

Cyber-infringement detection and investigation

SNA :social mining, crisis management

Maritime security: predictive analysis & anomaly

detection

E-border: Big Analytics on passengers logs

applications to vehicle , radar, weapon systems, transport…

HUMS :(Health & Usage Monitoring Systems)

(17)

Big Analytics innovation trends at medium range horizon

Coupling

Auto-Encoders Neural Nets

with

Predictive Modeling

for

features

extraction

_{Opening the «}

_{Data Streaming}

_{Data Streaming Processing}

_Processing

_{» (real time) to more sophisticated}

and powerful analytical tools

_{Towards real life CEP}

_CEP

_{Coupling «}

_Genetic

_{Genetic Algorithms}

_Algorithms

_{» with «}

_{Relational linear}

_Relational

_linear

_{linear transforms}

_transforms

_»

Linearization procedures

In Networks Analysis, addressing the complexity

of

_of

of dynamic

_dynamic

dynamic graphs

_dynamic

_graphs

graphs

modeling.