Thales Communications & Security
Big Data: Quelques Enjeux Techniques
Essai de Typologie des Problèmes de Big Analytics
J.F. Marcotorchino
2
/
2
/
T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8BIG DATA/BIG ANALYTICS
3
/
3
/
T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8Definitions
Big Data: All the technologies and techniques that help
scaling
Large File Storage (virtual)
Distributed processing (Hadoop) / Map-reduce
NoSQL databases / simple & complex query
Big Analytics: Techniques that are executed on a BigData
infrastructure and have the following properties:
Adaptation of ad hoc techniques (statistics-learning) to this
environment
Scales Linearly (O(N) or O(NLog(N)) order of magnitude or subject
to heavy potential parallelization
Linearization is mandatory either at “criteria level” or at
“constraints polytopes level”
Use special type of learning techniques through dimensions
reduction.
4
/
4
/
T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8Les 4 V
The 4 V Challenge
Volume :
Large Storage Capacity are available now
NAS type (Network Attached Storage):
Virtualized Storage
Cloud Computing
Velocity:
Large Demand for Immediate results
Stream Analytics for SEP/ CEP (Stream &Complex event processing)
In memory Computations adapted to Key-Value stores
Variety:
Large Diversity of Heterogeneous Data Types
Structured Data (classical DB entries) or Semi Structureed Data
(Images with meta data added)
Unstructured Data: Text, Speech , Raw Images etc
Value:
Intrinsic Value of the couple « Data/Information » is
now recognized by Business companies
la (((*valeur «
α
N » (
α
entier) on doit répartir les calculs sur
α
machines pour conserver
5
/
5
/
T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8Some Confusions to Avo
i
d
Do not confound :
Combinatorial Complexity
vs
Indexing
complexity, difficulty of IT computations
vs
the
management of huge data volumes (HPC vs BIG DATA)
In the first case:
It is not the data amount per se which is a drawback, but the
intrinsic combinatorial structure of the problem to solve
:
Example:
≅≅≅≅
10
29300
solutions (Berendt -Tassa estimate 2010) to
explore for clustering a set of
N=10000
objects or individuals.
Nevertheless
N=10000 is not a huge amount
In the second case:
It is the data amount
itself
which poses a problem, through
the structure of the indexing and storing architectures.
(Difficulty due to the scalability constraints)
6
/
6
/
T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8How to address Scalability Problems
Scalability by
«
Linearization
»
VS Scalability by
«
Parallelization
»
In the First Mode
:
If for a population of
N
objects the needed computing time is
T, in case of a linear
algorithm it will take a computing time
≅≅≅≅ αααα
T
if the population size jumps up from
N
to
αααα
N.
In the Second Mode
:
If an algorithm
dedicated to a population size
N
can be
processed on a SINGLE machine within a time
T, then if the
la population scales up to
αααα
N
(
αααα
integer ), computations can
be distributed on «
αααα
» machines to keep a computing time
equal to :
T
Combination of both modes is the best possible approach
(if suitable)
An Operational Characterization of Big Analytics Methods
Big Data Analytics
: «
Extended
» VS
«
Intrinsic
»
cases
« Extended » Case:
Possible use of the NoSQL storing architectures, or new SQL ones
Exhaustive Analysis of the whole data set is
not mandatory at all
« Analytic Sampling » or « Big Sampling » are sufficient in most cases:
e.g:
Customers Segmentation, CRM, Cross selling , Churn & Attrition Analysis,
Intrusions Analysis or HUMS (Health & Usage Monitoring Systems)
.
The remaining set of the population except « samples » is processed by
An Operational Characterization of Big Analytics Methods
Big Data Analytics
: «
Extended
» VS
«
Intrinsic
»
cases
« Intrinsic » Case:
It is mandatory to rely on the full data se
t (
exhaustivity
),
even if avoiding
to do it , is still remaining a research topic
No a priori knowledge , or partial knowledge of the population structure
Data are stored through NoSQL architectures using the adequate
correspondence formats (
example for graphs DB: Neo4j , FlockDB (
open
source distributed, fault-tolerant graph database for managing data at scale., chosen
by Twitter
)
To manage the exhaustivity constraint, obligation to use heuristics or meta
heuristics based upon
linear iterations
,
or parallelization through
distributed computations
Some NoSQL DB Types
Key Value Stores
Key Value Stores
Key Value Stores
Key Value Stores
Column Oriented DB
Column Oriented DB
Column Oriented DB
Column Oriented DB
Document
Document
Document
Document Oriented
Oriented
Oriented
Oriented DB
DB
DB
DB
BigTable (
)
(Faceboo
Faceboo
Faceboo
Faceboo
k
k
k
k
)
Infinity DB
((((
Amazon
Amazon
Amazon
Amazon
))))
DynamoDB
DynamoDB
DynamoDB
DynamoDB
Graph Data Bases
Graph Data Bases
Graph Data Bases
Graph Data Bases
Neo4j
Neo4j
Neo4j
Neo4j
Complex grows like
Complex grows like
Complex grows like
Complex grows like
E
E
E
E
Rel
Rel
Rel
Rel
E
E
E
E
=
nb. of Entities
nb. of Entities
nb. of Entities
nb. of Entities
Rel
Rel
Rel
Rel
= average relationships /
average relationships /
average relationships /
average relationships /
entity
entity
entity
entity
direction ou services
BIG DATA CONCEPTUAL FOUNDATIONS
[Brewer CAP Assignment]
It is impossible to
satisfy the 3 items
choose 2
Consistancy
A
A
A
A
P
P
P
P
C
C
C
C
A
A
A
A
CP
MemcacheDB /Bekerley DB
Voldemort
Voldemort
Voldemort
Voldemort
CouchDB
HBase
Availability
Partition Tolerence
C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0
Some ideas for solving Intrinsic Big Analytics approaches
Use mainly
exhaustive methods
(if possible no statistical
sampling) (Data Driven vs Hypothesis Driven )
Affinity Analysis & Sequential Patterns
(
pure linear matchings scalar products
)
Use Classifiers with linear criteria
Practice Iterative Queries
R
2
I
2
:
Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité
Régularisée + Clustering « on the fly »)
Unsupervised Clustering (no a priori)
(
Extending « No K-Means » approaches using
linear relational criteria
)
Text mining
(
word spotting
)
Reticular Data Analysis
(Social Nets, Huge IT Networks)
12
/
12
/
T h e i n fo rm a ti o n co n ta in e d i n t h is d o cu m e n t a n d a n y a tt a ch m e n ts a re t h e p ro p e rt y o f T H A L E S . Y o u a re h e re b y n o ti fi e d t h a t a n y re v ie w , d is s e m in a ti o n , d is tr ib u ti o n , co p yi n g o r o th e rw is e u s e o f th is d o cu m e n t is s tr ict ly p ro h ib it e d w it h o u t T h a le s p ri o r w ri tt e n a p p ro v a l. © T H A L E S 2 0 1 1 . T e m p la te tr tp v e rs io n 7 .0 .8Reticular Data Structuring
Classical BI Data Mining
Tentative structuring of Big Analytics Approaches
Learning &Neural Nets
Vector Matching Structuring
Lack of Population Knowledge
L
e
v
e
l
o
f
P
ro
b
le
m
C
o
m
p
le
x
it
y
Learning Model for
unsupervised Classif
Limited Layers Neural Nets
Naïve Bayes
Networks
Self Encoded and Hourglass
Shaped Neural Nets
Image & Video
Analytics
Sequential Patterns Recognition &
Affinity Analysis
Parallel Coordinates
Unsupervised Clustering
Large Networks Topological
Design
Supervised Rule Based
Classification
Social Networks
Communities detection
Reticular Visual
Analytics
BiClass SVM
Faces &Pattern Recognition
Piecewise Linear Regression
Multi Classes
SVM
MOLAP and XOLAP
MDL Learning
C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0
An Example of Intrinsic Big Analytics Problem: Graphs Modularity
Girvan-Newman’s Quadratic formulation
“Liberal”
“Conservative”
“Centrist”
Krebs’ Graph on American Politics
S. Mandal (MIT)
MIT Heuristic Algo: Construct the modularity matrix and find its largest eigenvalue and
eigenvector
•
Partition network into two parts based on signs of elements in the largest eigenvector
•
Repeat for each part
•
If a proposed split does not cause modularity to increase, declare subgraph indivisible and do
not split it
•
When entire graph consists of indivisible subgraphs, stop
Typical running time
O(N
2
log N)
for a
sparse graph
modularity of network is “the
number of edges falling within
groups minus the expected
number in an equivalent
network with edges placed at
random.” (
“Deviation to
Independence”
)
•
Maximizing modularity
rigorously may be NP-hard
C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0
By relational transform we turn the criterion into
a
linear
function subject
to
linear
constraints
Idea
: relying on
the locally linear
« Louvain » algorithm
(Blondel-Guillaume
)
(Univ Louvain/UPMC LIP6) , use the Linear Relational Form
O(N LogN )
We can do more
: using the genericity of the Louvain ’s algo we can use
better linear criteria than the Girvan-Newman’s one based on
Optimal
Transport justifications
e.g:«
Deviation to Indetermination
» (Patricia
Conde- Cespèdes )
X
ij
– X
ji
= 0
∀
∀
∀
∀
(
i,j
) (Symmetry)
X
ii
= 1
∀
∀
∀
∀
i
(Reflexivity)
X
ij
+ X
jk
– X
ik
≤≤≤≤
1
∀
∀
∀
∀
(
i,j,k
) (Transitivity)
C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0
Big Analytics :Some Topics of Interest
Big Analytics for
Cyber-Security
Big Analytics for
Smart Transport
Big Analytics for
National Security
Big Analytics for
maintenance:
Components for attack detection and investigation
(Intelligent IDS from normalized log analytics, IS passive
and dynamic mapping, logs analytics, cyber Intelligence
)
Attack detection from relational & content data, intelligent IDS
and sandbox coupling,
Intelligent coupling with IS passive and dynamic mapping
Big Data platform for logs analytics, visual analytics
Business Analytics Web portal for passenger behaviour
and profile understanding , traffic anomaly detection:
New components and use cases focused on mobility
Approach based on space-time queries, BI, early warning
engine, Big Analytics and optimization technics for Smart City
Fraud detection
Social Web Intelligence for National Security :
Cyber-infringement detection and investigation
SNA :social mining, crisis management
Maritime security: predictive analysis & anomaly
detection
E-border: Big Analytics on passengers logs
applications to vehicle , radar, weapon systems, transport…
HUMS :(Health & Usage Monitoring Systems)
C e d o cu m e n t n e p e u t ê tr e r e p ro d u it , m o d if ié , a d a p té , p u b li é , tr a d u it , d 'u n e q u e lco n q u e f a ço n , e n t o u t o u p a rt ie , n i d iv u lg u é à u n t ie rs s a n s l 'a cco rd p ré a la b le e t é cr it d e T h a le s © T H A L E S 2 0 1 2 T o u s D ro it s r é s e rv é s M o d è le tr tp v e rs io n 7 .1 .0