12/02/2015
Big Data & Security
Welcome to Madrid !!!
▶
Big Data AND security: what is there
on our minds ?
▶
Big Data tools and technologies
▶
Big Data T&T chain and
security/privacy concern mappings
▶
From Strategies to concrete solutions
▶
Future research topics
▶
SECCORD and PHEME
3
BD 4 SEC; SEC 4 BD or BD & SEC?
Big Data for
5
Plethora of “Big Data” related tools
Technology landscape
• Batch processing: Apache Hadoop, Spark... • Real-time processing: Apache Storm, S4, Spark • Messaging & queues: Apache Flume, Kafka
• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D data visualization, mobile visualization
• Related technologies: CEP, pipelines, sensor and machine acquisition APIs, Social Networks APIs .
• Batch processing: Apache Hadoop, Spark... • Real-time processing: Apache Storm, S4, Spark • Messaging & queues: Apache Flume, Kafka
• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D data visualization, mobile visualization
• Related technologies: CEP, pipelines, sensor and machine acquisition APIs, Social Networks APIs .
Big Data Baseline technologies
Big Data Baseline technologies
• Machine Learning, Deep Learning, Data mining, Web mining, • Statistical methods, pattern recognition
• Decision Support Systems, predictive & prescriptive analytics • Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:
Analytical Processes: KNIME, RapidMiner; Statistical Software: R,
Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology: CRISP-DM; Standards: Predictive Model Markup Language (PMML) • Machine Learning, Deep Learning, Data mining, Web mining,
• Statistical methods, pattern recognition
• Decision Support Systems, predictive & prescriptive analytics • Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:
Analytical Processes: KNIME, RapidMiner; Statistical Software: R,
Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology: CRISP-DM; Standards: Predictive Model Markup Language (PMML)
Advanced Data Analytics
Advanced Data Analytics
• Natural Language Processing,
• Name Entity Recognition, PoS tagging, language detection • (Semi)Automatic categorization and annotation
• Natural Language Processing,
• Name Entity Recognition, PoS tagging, language detection • (Semi)Automatic categorization and annotation
Language technologies
Language technologies
• NoSQL: HBase, Cassandra, MongoDB, Neo4j • Triplestores: Sesame, GraphDB
• NewSQL
• In-memory processing (SAP Hana..)
• NoSQL: HBase, Cassandra, MongoDB, Neo4j • Triplestores: Sesame, GraphDB
• NewSQL
• In-memory processing (SAP Hana..)
Big Data storage
Big Data storage
• Ontology engineering • Linked Data
• Formal Semantics (DL, OWL, FOL ) • Semantic Interoperability
• Ontology engineering • Linked Data
• Formal Semantics (DL, OWL, FOL ) • Semantic Interoperability
Semantics
Semantics
• Big Data reference architectures • Lambda architecture
• Scalable solutions, fit-for-purpose solutions • Standards
• Big Data reference architectures • Lambda architecture
• Scalable solutions, fit-for-purpose solutions • Standards
Big Data architectures
Big Data architectures
7
Security and privacy technologies
in the Big Data Value Chain
Data
Storage
Data
Storage
Data
Acquisition
Data
Acquisition
Data Usage
Data Usage
Data
Analysis
Data
Analysis
Data
pre-processing
Data
pre-processing
Data
Curation
Data
Curation
NO
HDFS
Hbase, Cassandra
MongoDB,
ElephantDB
Neo4J,
Triplestores
…
Models
Veracity
Matching
Cleansing
Validation
Update
Social Networks
IoT
Web
…
CEP
Messaging
Pub-sub
Apache Kafka
Apache Flume
…
Visualization
2D, 3D
Mobile, APIs
D3, Tableau
Big Data
Architecture
Big Data
Architecture
Data
science
Data
science
text mining
text analysis
sentiment analysis
case base reasoning
real time data analytics
data mining
machine learning
deep learning
…
Hadoop, Storm, Spark…
data scientist
statistics
data mining
machine learning
Reference
Lambda Arch.
Pipelines
…
Filtering
Cleansing
Aggregation
Fusion
Annotation
Categorization
NLP, NER
R, Octave
ML frameworks
Weka, Apache Mahout
Consent in
M2M?
Anonymization?
Access and usage
policies ??
9
Strategy analysis
▶
MINIMIZE (Collection stage):
–
data posted on SN, collected as a service requirement, collected as a legal
requirement, collected automatically and unknowingly (e.g. location),
inferred by previous processing, bought and added from external sources,
shared with external sources
–
Recommendation : move from consent to reputational penalties (trust
index)
▶
HIDE (Pre-processing):
–
side-information, meta-data leakage (e.g. location etc)
–
anonymize/de-identify not feasible on a long term
–
adding “noise”, use intermediator (trusted privacy proxy), publish
“epsilons”
▶
HIDE (Processing):
–
functions over encrypted data
Strategy analysis
▶
CONTROL (Data usage and analysis)
–
Express purpose, context, usage…in a data policy
–
Associate policy with data (“sticky” policy) and the processing component
(monitor and enforcement)
–
From NL policy ro MR policy and data-tags (tranformation, refinement)
–
From input policies to (computed) output policies
–
Build EU regulation library of NL2MR patterns
▶
BD results and Post-use impact
–
Discrimination
–
Data divide
–
Power imbalance
–
Echo chambers
11
Need to define FUTURE security research topics for
BD
▶
Secure data conditioning
▶
Tamper resistant logs (e.g. TR- Flume, privacy in auditing)
▶
Secure object storage
▶
Secure “divide and conquer” computation approach
▶
Secure stream processing
▶
SW and ontologies for policy conflict resolution, policy transformation and refinement
▶
Extracting and sharing cybersecurity linked data
▶
Security metadata and tagging (e.g. machine readible certificates, use BD to
semi-automate tagging)
▶
Security and machine learning e.g. recomendation engines, predicton, intelligent
agents, risk assessment, distributed ML, simulation games…
▶
Threats from unsupervised machine learning algorithms
▶
Secure infomediaries, data value added resellers (VAR), data
marketplaces(
https://gnip.com
)
▶
Statistical models, correlation rules, logic etc
SECCORD trend analysis
▶
SECCORD D5.4 : Big Data impact on security
▶
Opening up security data repositories (ACDC)
▶
Role of unstructured text in SN-based botnet C&C
▶
Patterns of abnormal behavior
▶
Pattern recognition = discovery (data mining to find patterns) + detection (apply
pattern to find e.g. anomaly)
▶
False alarm reduction
13
The 3+ V’s of Big Data
▶
1. Volume (lots of data Zettabytes)
▶
2. Variety (complexity, dimensionality)
▶
3. Velocity (fast data)
+
▶
4. Veracity (truthfulness, curation)
▶
5. Venue (location)
▶
6. Vocabulary (semantics)
▶
7. Variability …
15
Conclusion nr. 2 (not definitive): your privacy will be
in hands of “data curator”
12/02/2015
Thank you
Atos Research & Innovation
[email protected]
Atos, the Atos logo, Atos Consulting, Atos Worldline, Atos Sphere, Atos Cloud and Atos WorldGrid
are registered trademarks of Atos SA. June 2011
© 2011 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.