• No results found

Big Data & Security. Aljosa Pasic 12/02/2015

N/A
N/A
Protected

Academic year: 2021

Share "Big Data & Security. Aljosa Pasic 12/02/2015"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

12/02/2015

Big Data & Security

(2)

Welcome to Madrid !!!

Big Data AND security: what is there

on our minds ?

Big Data tools and technologies

Big Data T&T chain and

security/privacy concern mappings

From Strategies to concrete solutions

Future research topics

SECCORD and PHEME

(3)

3

BD 4 SEC; SEC 4 BD or BD & SEC?

Big Data for

(4)
(5)

5

Plethora of “Big Data” related tools

(6)

Technology landscape

• Batch processing: Apache Hadoop, Spark... • Real-time processing: Apache Storm, S4, Spark • Messaging & queues: Apache Flume, Kafka

• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D data visualization, mobile visualization

• Related technologies: CEP, pipelines, sensor and machine acquisition APIs, Social Networks APIs .

• Batch processing: Apache Hadoop, Spark... • Real-time processing: Apache Storm, S4, Spark • Messaging & queues: Apache Flume, Kafka

• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D data visualization, mobile visualization

• Related technologies: CEP, pipelines, sensor and machine acquisition APIs, Social Networks APIs .

Big Data Baseline technologies

Big Data Baseline technologies

• Machine Learning, Deep Learning, Data mining, Web mining, • Statistical methods, pattern recognition

• Decision Support Systems, predictive & prescriptive analytics • Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:

Analytical Processes: KNIME, RapidMiner; Statistical Software: R,

Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology: CRISP-DM; Standards: Predictive Model Markup Language (PMML) • Machine Learning, Deep Learning, Data mining, Web mining,

• Statistical methods, pattern recognition

• Decision Support Systems, predictive & prescriptive analytics • Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:

Analytical Processes: KNIME, RapidMiner; Statistical Software: R,

Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology: CRISP-DM; Standards: Predictive Model Markup Language (PMML)

Advanced Data Analytics

Advanced Data Analytics

• Natural Language Processing,

• Name Entity Recognition, PoS tagging, language detection • (Semi)Automatic categorization and annotation

• Natural Language Processing,

• Name Entity Recognition, PoS tagging, language detection • (Semi)Automatic categorization and annotation

Language technologies

Language technologies

• NoSQL: HBase, Cassandra, MongoDB, Neo4j • Triplestores: Sesame, GraphDB

• NewSQL

• In-memory processing (SAP Hana..)

• NoSQL: HBase, Cassandra, MongoDB, Neo4j • Triplestores: Sesame, GraphDB

• NewSQL

• In-memory processing (SAP Hana..)

Big Data storage

Big Data storage

• Ontology engineering • Linked Data

• Formal Semantics (DL, OWL, FOL ) • Semantic Interoperability

• Ontology engineering • Linked Data

• Formal Semantics (DL, OWL, FOL ) • Semantic Interoperability

Semantics

Semantics

• Big Data reference architectures • Lambda architecture

• Scalable solutions, fit-for-purpose solutions • Standards

• Big Data reference architectures • Lambda architecture

• Scalable solutions, fit-for-purpose solutions • Standards

Big Data architectures

Big Data architectures

(7)

7

Security and privacy technologies

in the Big Data Value Chain

Data

Storage

Data

Storage

Data

Acquisition

Data

Acquisition

Data Usage

Data Usage

Data

Analysis

Data

Analysis

Data

pre-processing

Data

pre-processing

Data

Curation

Data

Curation

NO

HDFS

Hbase, Cassandra

MongoDB,

ElephantDB

Neo4J,

Triplestores

Models

Veracity

Matching

Cleansing

Validation

Update

Social Networks

IoT

Web

CEP

Messaging

Pub-sub

Apache Kafka

Apache Flume

Visualization

2D, 3D

Mobile, APIs

D3, Tableau

Big Data

Architecture

Big Data

Architecture

Data

science

Data

science

text mining

text analysis

sentiment analysis

case base reasoning

real time data analytics

data mining

machine learning

deep learning

Hadoop, Storm, Spark…

data scientist

statistics

data mining

machine learning

Reference

Lambda Arch.

Pipelines

Filtering

Cleansing

Aggregation

Fusion

Annotation

Categorization

NLP, NER

R, Octave

ML frameworks

Weka, Apache Mahout

Consent in

M2M?

Anonymization?

Access and usage

policies ??

(8)
(9)

9

Strategy analysis

MINIMIZE (Collection stage):

data posted on SN, collected as a service requirement, collected as a legal

requirement, collected automatically and unknowingly (e.g. location),

inferred by previous processing, bought and added from external sources,

shared with external sources

Recommendation : move from consent to reputational penalties (trust

index)

HIDE (Pre-processing):

side-information, meta-data leakage (e.g. location etc)

anonymize/de-identify not feasible on a long term

adding “noise”, use intermediator (trusted privacy proxy), publish

“epsilons”

HIDE (Processing):

functions over encrypted data

(10)

Strategy analysis

CONTROL (Data usage and analysis)

Express purpose, context, usage…in a data policy

Associate policy with data (“sticky” policy) and the processing component

(monitor and enforcement)

From NL policy ro MR policy and data-tags (tranformation, refinement)

From input policies to (computed) output policies

Build EU regulation library of NL2MR patterns

BD results and Post-use impact

Discrimination

Data divide

Power imbalance

Echo chambers

(11)

11

Need to define FUTURE security research topics for

BD

Secure data conditioning

Tamper resistant logs (e.g. TR- Flume, privacy in auditing)

Secure object storage

Secure “divide and conquer” computation approach

Secure stream processing

SW and ontologies for policy conflict resolution, policy transformation and refinement

Extracting and sharing cybersecurity linked data

Security metadata and tagging (e.g. machine readible certificates, use BD to

semi-automate tagging)

Security and machine learning e.g. recomendation engines, predicton, intelligent

agents, risk assessment, distributed ML, simulation games…

Threats from unsupervised machine learning algorithms

Secure infomediaries, data value added resellers (VAR), data

marketplaces(

https://gnip.com

)

Statistical models, correlation rules, logic etc

(12)

SECCORD trend analysis

SECCORD D5.4 : Big Data impact on security

Opening up security data repositories (ACDC)

Role of unstructured text in SN-based botnet C&C

Patterns of abnormal behavior

Pattern recognition = discovery (data mining to find patterns) + detection (apply

pattern to find e.g. anomaly)

False alarm reduction

(13)

13

The 3+ V’s of Big Data

1. Volume (lots of data Zettabytes)

2. Variety (complexity, dimensionality)

3. Velocity (fast data)

+

4. Veracity (truthfulness, curation)

5. Venue (location)

6. Vocabulary (semantics)

7. Variability …

(14)
(15)

15

Conclusion nr. 2 (not definitive): your privacy will be

in hands of “data curator”

(16)

12/02/2015

Thank you

Atos Research & Innovation

[email protected]

Atos, the Atos logo, Atos Consulting, Atos Worldline, Atos Sphere, Atos Cloud and Atos WorldGrid

are registered trademarks of Atos SA. June 2011

© 2011 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.

References

Related documents

The data used in this analysis come from the Medicaid Statistical Information System (MSIS) Summary File maintained by the Centers for Medicare and Medicaid

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

During the 1 st Month Establishing goals (all new and veteran teachers) Goal Setting for Student Progress Form   Before the end of.. the 1 st

The case where any particular sector should be evacuated can easily be represented on the model diagram as in the example of the evacuation of the Eastern-side (Section- A in Fig.

Consequently, the industry in this sector has closely examined the strategy of substituting refined wheat flour for wholemeal ingredients with high added value, such as

T h e second approximation is the narrowest; this is because for the present data the sample variance is substantially smaller than would be expected, given the mean

Simulating clinical concentrations and delivery rates of a typical intravenous infusion, a variety of routinely used pharmaceutical drugs were tested for potential binding to

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’