• No results found

Machine Learning for Cyber Security Intelligence

N/A
N/A
Protected

Academic year: 2021

Share "Machine Learning for Cyber Security Intelligence"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

27th FIRST Conference 17 June 2015

Machine Learning for Cyber

Security Intelligence

(3)

Introduction |

whois

Edwin Tump

10 yrs at NCSC.NL (GOVCERT.NL)

 9 yrs as a security specialist

 1 yr as a security analyst

Areas of special interest

 Information collection (e.g. Taranis)  Tooling

(4)
(5)

Agenda

- Current situation

- Challenges

- Desired situation

- Approach

(6)

Current situation | Taranis

Taranis

n sources

nx4 items

Day 6,000 items Week 42,000 items Month 180,000 items 1,500 sources
(7)
(8)

Current situation | Taranis keyword matching

Automatic dispatch of irrelevant items

 Only for specific sources

 Relevant items determined based on keyword lists

Threats

Cross-Site Scripting SQL-injection Vulnerability Defacement Data dump

Constituents

Ministry of Security and Justice Ministry of … Power Company X Bank Y
(9)

Current situation | Taranis deduplication

(10)

Current situation | Taranis 5 phases

(11)

Current situation | Outputs

Taranis

End-of-Shift End-of-Week Internal analysis External analysis Monthly monitor Factsheet Dossier Security advisory
(12)

Challenges

Im ag e c our tes y of S tuar t M iles at F reeD ig ital P ho to s .ne t
(13)

Challenges

Handling growing amount of information is labor intensive

 Determine relevant topics  Determine new trends

 Rate the reliability of news: the good, the bad and the ugly  Keep dossiers up-to-date

Higher expectations; reports must be …

 … timely

 … complete  … short

(14)

Desired situation

Im ag e c our tes y of S tuar t M iles at F reeD ig ital P ho to s .ne t
(15)
(16)

Desired situation

… cluster items and detect stories

(17)

Desired situation

… determine story relevance

(18)

Desired situation

… assign items to dossiers

(19)

Approach

Im ag e c our tes y of S tuar t M iles at F reeD ig ital P ho to s .ne t
(20)

Approach

Agile-scrum

 Close contact between participants

 Multiple short sprints (4 sprints of 2 weeks)

Principles

 High level of trust

 Use of well-known concepts in the field of machine learning  Open mind, not restricted by a ready-mode product

Deliverables

 Better understanding of usefulness of machine learning techniques  Proof-of-Concept(s)

 Detailed requirements for future work  Paper describing the results

(21)

Approach

Data (Taranis)

Expertise

 Source data  Process  Products

Tools

Expertise

 Text mining  Machine learning  Information retrieval
(22)

Im ag e c our tes y of S tuar t M iles at F reeD ig ital P ho to s .ne t

(23)

Machine learning (ML)

Main methods

 Supervised learning o Classification o Regression  Unsupervised learning o Clustering o Association o Density estimation o Dimensionality reduction
(24)

Proof-of-Concept toolset

PostgreSQL

 Relational database

 Offline partial copy of the Taranis database  http://www.postgresql.org/

R

 A “free software environment for statistical computing and

graphics”

 http://www.R-project.org

Tableau

 Data visualization / data analysis  http://www.tableau.com/

FoamTree

(25)

Overview

Taranis

Language

detection

Preprocessing

and

representation

Story

detection and

linking

LDA Topics

(26)

Language detection

With

regular

news

Without

regular

news

53% 38% 76% 13%
(27)

Preprocessing and representation

Remove noisy words (stop words)

Remove duplicates

Remove numbers

Remove punctuation marks

Normalize case (upper to lower)

Stem words

(28)

Example

Exploit Kit Delivers DNS Changer to Thousands of Routers

A malicious campaign deployed by cybercriminals aims at changing

the Domain Name System (DNS) server settings in router

configuration, responsible for retrieving the correct web pages from

legitimate web servers. An attacker changing these settings can

point to malicious locations, exposing the victim to a wide range of

risks varying from credential stealing and ad-fraud to traffic

interception and malware delivery.

exploit deliv changer thousand router malicy campaign deploy

cybercrimin chang domain server router configur respons retriev

correct page legitim server attacker chang point malicy locat expos

victim wide rang risk vary credenty steal fraud traffic intercept

(29)

k Means

History already begun in 1957, but published in 1982 by Stuart

Lloyd

Unsupervised

Widely used because of simplicity and

performance

Words represented as vectors

 Vectors calculated based on TF-IDF

Steps:

Assignment: choose k cluster

centers and then assign each point to the closest cluster center

(30)
(31)

Story Detection and Relevance

Steps

1. Deduplication

2. Clustering in (overlapping) sliding windows

 Windows of 12 hrs  Slices/slicing of 2 hrs

3. Clustering between time slices

4. Determine story relevance, based on

 Volume of documents within the story

Source reputation (derived from mail actions on items from source)  Cross-category stories

Important parameters

 ‘within story’ (0…1) 12 hrs 2 hrs cluster story
(32)

Project focus

Story Relevance | Taranis 5 phases

Collect

Assess

Analyze

Write

Publish

collected only

collected and forwarded

collected and analyzed

collected and used in publication

(33)
(34)
(35)
(36)
(37)

Support Vector Machines (SVM)

Supervised classification

SVMs take your data and draw a hyper plane to divide your

dataset into groups of positive and negative observations

Types of classification:

 Binary classification (simplest)  Multi-category classification

o One-against-one o One-against-rest

Cyber Security Assessment

Report Netherlands 2013

(38)
(39)
(40)

Latent Dirichlet Allocation

Unsupervised

Assumption: documents are generated by latent topics, the hidden

variables in from which the documents are observed from

Main motivation: uncover underlying semantic structures (=hidden

topics) of a document

Each document is a mix of topics, defined by a collection of words

Variables:

 K = number of topics  W = word

 D = documents  Z = topic

Continually answer the following questions for each word:

 How often does "W" appear in topic "Z" elsewhere?
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)

Conclusions

Im ag e c our tes y of S tuar t M iles at F reeD ig ital P ho to s .ne t
(49)

Conclusions

Overall

 Promising

 We already trained our system (without knowing we did…)

 Interactive approach leads to quick development of knowledge

Proof-of-Concept

 Stories found are understandable and relevant  Linking items to existing dossiers is successful  Determining new dossiers is difficult

(50)

Thanks for your attention!

References

Related documents

2) Using RV pop test data, fill in the table in page 1. 3) Determine Spring Constant Ks using equation B1 and B2 page 2. Ks calculated using B1 is more accurate, but both

Page 3: D-Link Wireless Router Page 4: Linksys Wireless Router Page 5: Linksys Wired Router Page 6: Westell Router Page 7: Netgear Router Page 8: Netopia Router

Router(config)# snmp-server view myview iso included Router(config)# snmp-server view myview atEntry excluded Router(config)# snmp-server view myview ipRouteEntry

The 4735 genes that did not overlap with OGSv1.0 genes could be classified as 782 genes discov- ered due to the additional sequencing and reassembly of the bee genome for the

Using virtual machines we can simulate a network, but with a virtual network, the DHCP server is often the host, meaning the router and the DHCP server are the same

Radiation Protection Access- Protection Data- Encryption Security Workstation HC-6950 General Application Server Router Router Private Data and Intranet Server Subnet A

De este modo, aunque con desigual presencia según las condiciones naturales y la densidad de pobla- miento, las “minutas” cartográficas o los “cuadernos de líneas

Unconditional love up with birthday for dad heaven from daughter in life could wish you are amazing daughter can make me into my days?. Funkiest