• No results found

Big Data: Quelques Enjeux Techniques

N/A
N/A
Protected

Academic year: 2021

Share "Big Data: Quelques Enjeux Techniques"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

Thales Communications & Security

Big Data: Quelques Enjeux Techniques

Essai de Typologie des Problèmes de Big Analytics

J.F. Marcotorchino

(2)

2

/

2

/

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

BIG DATA/BIG ANALYTICS

(3)

3

/

3

/

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

Definitions

Big Data: All the technologies and techniques that help

scaling



Large File Storage (virtual)



Distributed processing (Hadoop) / Map-reduce



NoSQL databases / simple & complex query

Big Analytics: Techniques that are executed on a BigData

infrastructure and have the following properties:



Adaptation of ad hoc techniques (statistics-learning) to this

environment



Scales Linearly (O(N) or O(NLog(N)) order of magnitude or subject

to heavy potential parallelization



Linearization is mandatory either at “criteria level” or at

“constraints polytopes level”



Use special type of learning techniques through dimensions

reduction.

(4)

4

/

4

/

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

Les 4 V

The 4 V Challenge



Volume :

Large Storage Capacity are available now



NAS type (Network Attached Storage):



Virtualized Storage



Cloud Computing



Velocity:

Large Demand for Immediate results



Stream Analytics for SEP/ CEP (Stream &Complex event processing)



In memory Computations adapted to Key-Value stores



Variety:

Large Diversity of Heterogeneous Data Types



Structured Data (classical DB entries) or Semi Structureed Data

(Images with meta data added)



Unstructured Data: Text, Speech , Raw Images etc



Value:

Intrinsic Value of the couple « Data/Information » is

now recognized by Business companies

la (((*valeur «

α

N » (

α

entier) on doit répartir les calculs sur

α

machines pour conserver

(5)

5

/

5

/

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

Some Confusions to Avo

i

d

Do not confound :

Combinatorial Complexity

vs

Indexing

complexity, difficulty of IT computations

vs

the

management of huge data volumes (HPC vs BIG DATA)



In the first case:

It is not the data amount per se which is a drawback, but the

intrinsic combinatorial structure of the problem to solve

:



Example:

≅≅≅≅

10

29300

solutions (Berendt -Tassa estimate 2010) to

explore for clustering a set of

N=10000

objects or individuals.



Nevertheless

N=10000 is not a huge amount



In the second case:

It is the data amount

itself

which poses a problem, through

the structure of the indexing and storing architectures.

(Difficulty due to the scalability constraints)

(6)

6

/

6

/

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

How to address Scalability Problems

Scalability by

«

Linearization

»

VS Scalability by

«

Parallelization

»



In the First Mode

:

If for a population of

N

objects the needed computing time is

T, in case of a linear

algorithm it will take a computing time

≅≅≅≅ αααα

T

if the population size jumps up from

N

to

αααα

N.



In the Second Mode

:

If an algorithm

dedicated to a population size

N

can be

processed on a SINGLE machine within a time

T, then if the

la population scales up to

αααα

N

(

αααα

integer ), computations can

be distributed on «

αααα

» machines to keep a computing time

equal to :

T

Combination of both modes is the best possible approach

(if suitable)

(7)

An Operational Characterization of Big Analytics Methods

Big Data Analytics

: «

Extended

» VS

«

Intrinsic

»

cases



« Extended » Case:



Possible use of the NoSQL storing architectures, or new SQL ones



Exhaustive Analysis of the whole data set is

not mandatory at all



« Analytic Sampling » or « Big Sampling » are sufficient in most cases:

e.g:

Customers Segmentation, CRM, Cross selling , Churn & Attrition Analysis,

Intrusions Analysis or HUMS (Health & Usage Monitoring Systems)

.



The remaining set of the population except « samples » is processed by

(8)

An Operational Characterization of Big Analytics Methods

Big Data Analytics

: «

Extended

» VS

«

Intrinsic

»

cases



« Intrinsic » Case:



It is mandatory to rely on the full data se

t (

exhaustivity

),

even if avoiding

to do it , is still remaining a research topic



No a priori knowledge , or partial knowledge of the population structure



Data are stored through NoSQL architectures using the adequate

correspondence formats (

example for graphs DB: Neo4j , FlockDB (

open

source distributed, fault-tolerant graph database for managing data at scale., chosen

by Twitter

)



To manage the exhaustivity constraint, obligation to use heuristics or meta

heuristics based upon

linear iterations

,

or parallelization through

distributed computations

(9)

Some NoSQL DB Types

Key Value Stores

Key Value Stores

Key Value Stores

Key Value Stores

Column Oriented DB

Column Oriented DB

Column Oriented DB

Column Oriented DB

Document

Document

Document

Document Oriented

Oriented

Oriented

Oriented DB

DB

DB

DB

BigTable (

Google

Google

Google

Google

)

(Faceboo

Faceboo

Faceboo

Faceboo

k

k

k

k

)

Infinity DB

((((

Amazon

Amazon

Amazon

Amazon

))))

DynamoDB

DynamoDB

DynamoDB

DynamoDB

Graph Data Bases

Graph Data Bases

Graph Data Bases

Graph Data Bases

Neo4j

Neo4j

Neo4j

Neo4j

Complex grows like

Complex grows like

Complex grows like

Complex grows like

E

E

E

E

Rel

Rel

Rel

Rel

E

E

E

E

=

nb. of Entities

nb. of Entities

nb. of Entities

nb. of Entities

Rel

Rel

Rel

Rel

= average relationships /

average relationships /

average relationships /

average relationships /

entity

entity

entity

entity

(10)

direction ou services

BIG DATA CONCEPTUAL FOUNDATIONS

[Brewer CAP Assignment]

It is impossible to

satisfy the 3 items

choose 2

Consistancy

A

A

A

A

P

P

P

P

C

C

C

C

A

A

A

A

CP

MemcacheDB /Bekerley DB

Voldemort

Voldemort

Voldemort

Voldemort

CouchDB

HBase

Availability

Partition Tolerence

(11)

C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0

Some ideas for solving Intrinsic Big Analytics approaches

Use mainly

exhaustive methods

(if possible no statistical

sampling) (Data Driven vs Hypothesis Driven )



Affinity Analysis & Sequential Patterns

(

pure linear matchings scalar products

)



Use Classifiers with linear criteria



Practice Iterative Queries



R

2

I

2

:

Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité

Régularisée + Clustering « on the fly »)



Unsupervised Clustering (no a priori)

(

Extending « No K-Means » approaches using

linear relational criteria

)



Text mining

(

word spotting

)



Reticular Data Analysis

(Social Nets, Huge IT Networks)

(12)

12

/

12

/

T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8

(13)

Reticular Data Structuring

Classical BI Data Mining

Tentative structuring of Big Analytics Approaches

Learning &Neural Nets

Vector Matching Structuring

Lack of Population Knowledge

L

e

v

e

l

o

f

P

ro

b

le

m

C

o

m

p

le

x

it

y

Learning Model for

unsupervised Classif

Limited Layers Neural Nets

Naïve Bayes

Networks

Self Encoded and Hourglass

Shaped Neural Nets

Image & Video

Analytics

Sequential Patterns Recognition &

Affinity Analysis

Parallel Coordinates

Unsupervised Clustering

Large Networks Topological

Design

Supervised Rule Based

Classification

Social Networks

Communities detection

Reticular Visual

Analytics

BiClass SVM

Faces &Pattern Recognition

Piecewise Linear Regression

Multi Classes

SVM

MOLAP and XOLAP

MDL Learning

(14)

C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0

An Example of Intrinsic Big Analytics Problem: Graphs Modularity

Girvan-Newman’s Quadratic formulation

“Liberal”

“Conservative”

“Centrist”

Krebs’ Graph on American Politics

S. Mandal (MIT)

MIT Heuristic Algo: Construct the modularity matrix and find its largest eigenvalue and

eigenvector

Partition network into two parts based on signs of elements in the largest eigenvector

Repeat for each part

If a proposed split does not cause modularity to increase, declare subgraph indivisible and do

not split it

When entire graph consists of indivisible subgraphs, stop

Typical running time



O(N

2

log N)

for a

sparse graph

modularity of network is “the

number of edges falling within

groups minus the expected

number in an equivalent

network with edges placed at

random.” (

“Deviation to

Independence”

)

Maximizing modularity

rigorously may be NP-hard

(15)

C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0

By relational transform we turn the criterion into

a

linear

function subject

to

linear

constraints

Idea

: relying on

the locally linear

« Louvain » algorithm

(Blondel-Guillaume

)

(Univ Louvain/UPMC LIP6) , use the Linear Relational Form









O(N LogN )

We can do more

: using the genericity of the Louvain ’s algo we can use

better linear criteria than the Girvan-Newman’s one based on

Optimal

Transport justifications

e.g:«

Deviation to Indetermination

» (Patricia

Conde- Cespèdes )

X

ij

– X

ji

= 0

(

i,j

) (Symmetry)

X

ii

= 1

i

(Reflexivity)

X

ij

+ X

jk

– X

ik

≤≤≤≤

1

(

i,j,k

) (Transitivity)

(16)

C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0

Big Analytics :Some Topics of Interest

Big Analytics for

Cyber-Security

Big Analytics for

Smart Transport

Big Analytics for

National Security

Big Analytics for

maintenance:

Components for attack detection and investigation

(Intelligent IDS from normalized log analytics, IS passive

and dynamic mapping, logs analytics, cyber Intelligence

)



Attack detection from relational & content data, intelligent IDS

and sandbox coupling,



Intelligent coupling with IS passive and dynamic mapping



Big Data platform for logs analytics, visual analytics

Business Analytics Web portal for passenger behaviour

and profile understanding , traffic anomaly detection:



New components and use cases focused on mobility



Approach based on space-time queries, BI, early warning

engine, Big Analytics and optimization technics for Smart City



Fraud detection

Social Web Intelligence for National Security :



Cyber-infringement detection and investigation



SNA :social mining, crisis management

Maritime security: predictive analysis & anomaly

detection

E-border: Big Analytics on passengers logs

applications to vehicle , radar, weapon systems, transport…

HUMS :(Health & Usage Monitoring Systems)

(17)

C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0

Big Analytics innovation trends at medium range horizon



Coupling

Auto-Encoders Neural Nets

with

Predictive Modeling

for



features

extraction



Opening the «

Data Streaming

Data Streaming

Data Streaming

Data Streaming Processing

Processing

Processing

Processing

» (real time) to more sophisticated

and powerful analytical tools



Towards real life CEP

CEP

CEP

CEP



Coupling «

Genetic

Genetic

Genetic

Genetic Algorithms

Algorithms

Algorithms

Algorithms

» with «

Relational linear

Relational

Relational

Relational

linear

linear

linear transforms

transforms

transforms

transforms

»



Linearization procedures



In Networks Analysis, addressing the complexity

of

of

of

of dynamic

dynamic

dynamic graphs

dynamic

graphs

graphs

graphs

modeling.

References

Related documents

On any day, students may ask their teacher how many points they have; then, they may consult the grade scale (see below) posted in both boys’ and girls’ locker rooms, and online

[r]

A file ’ s attributes such as keywords being included in contents of files, file metadata (document s’ generation time, modified time, recently accessed time, file sizes, file

The retrofit concept is based on energy efficiency measures (reduction of transmission, infiltration and ventilation losses), on a high ratio of renewable energy sources and on

14 The Unit- ed States District Court for the District of New Jersey affirmed the agency’s decision, determining “that the twins would qualify for benefits only if, as

Similarly, comfort and travel time are valued higher by commuters from zones close to CBD (i.e., within 5 km to the CBD) than those from city peripherals. It was, how- ever,

Deployment, Osofsky and Chartrand write, is particularly stressful for the youngest children, who depend on their parents for nearly everything. Not only does deployment separate

hubungannya dengan latency (waktu tunda), yaitu total waktu yang diperlukan oleh sebuah paket data untuk berpindah dari dari sumber ke tujuannya. Tinggi rendahnya latency