• No results found

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

N/A
N/A
Protected

Academic year: 2021

Share "Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Complexity and Scalability in Semantic Graph Analysis

Semantic Days 2013 James Maltby, Ph.D

1

(2)

Outline of Presentation

• Semantic Graph Analytics

• Database Architectures

• In-memory Semantic Database Formulation

• Use Cases

• Conclusions and Challenges

2

(3)

Many Big Data Business Problems are based on Graphs

Intelligence/

Security Social

Networking

Telecom/

Mobile

Life Sciences/

Biology

Supply Chain

Healthcare/

Medicine

Internet/

www Finance

3

(4)

Graph Structured data presents unique challenges

• Graphs can be very difficult to partition

– Difficult for parallel distributed memory systems

• Accessing graphs requires irregular data access patterns

– Complicates data layout

• Data volumes are often very large

• Inserting new graph data is often required

4

(5)

Graph Analytics: Investigation and Discovery

• Discovery : The desired outcome of an investigative analytical process

• Often, this proceeds as follows – Start with an initial, possibly ill-

defined hypothesis (optional)

– Gather initial data, run queries to test hypothesis (or simply explore) – Based on results,

• Reformulate queries

• Add new data

• Look for patterns/connections spanning existing and new data

• Repeat/Iterate

• Contrast with

– Aggregative/Summary analytics (OLAP) – Computational analytics (simulations,

numeric/statistical techniques)

Difficult to predict what data is needed

Difficult to predict how much data is needed

Difficult to predict what

questions will be asked

(6)

Why is discovery hard?

Rigid Structure (Schema)

 Hard to add new sources

Relationships are ‘fixed and forgotten’

Scaling is hard

 Inability to easily and transparently add capacity

 Performance is often affected with increasing volumes

Performance is unpredictable

 Cannot pre-optimize for unknown questions

 Advanced analytics is difficult if performance is unpredictable

Today’s approaches

Difficult to predict what data is needed

Difficult to predict how much data is needed

Difficult to predict what

questions will be asked

(7)

Database Architectures

• Relational Database Management System (RDBMS)

– Most mature and widespread technology

– Typically optimized for Online Transaction Processing (OLTP)

• Rapid access to specific data items

• Many joins required

– Highly normalized using Third Normal Form (3NF) – Many tables, complex schemas

– Usually queried with SQL

• Data Warehouse

– Also a relational store

– Optimized for Online Analytic Processing (OLAP)

• More flexible queries across large areas of data

• Fewer joins

– Denormalized (redundancy added) for flexibility and performance – Some parallelism

– May have Extract, Transform and Load (ETL) functionality

7

(8)

Hadoop based approaches

• Very popular new storage paradigm

– Hbase, Cloudera and many others

• Highly parallel distributed system

– Horizontal scaling to very large systems – Large number of inexpensive nodes

• Often integrated with ETL

– “Bring the computation to the data”

• Searching for patterns often requires extensive programming

• Requires data to be partitionable across nodes

8

(9)

Semantic Graph Databases

• Designed for irregular data

– Graph based, not table based

• Ontologies are used instead of schemas

• Often defined in Resource Description Format (RDF)

– Other standards are Cyc, Cypher and Gremlin

• Often queried with SPARQL

• Joins are implicit, not explicit

– Simplifies query generation for data exploration

• Some Graph databases are Hadoop based

9

(10)

10

Import Graph Datasets

User/App Visualization

Export Analytic/

Relationship Results

Hadoop Other Big Data Appliance

Existing Analytic Environment(s

)

Data Warehouse

Programmatic Interfaces

Semantic Database

(11)

11

Differentiated Hardware Engine

 Originally designed for deep analysis of large datasets

Very large scalable shared memory

Architecture can support 512TB shared memory

Typical systems are 2 TB to 32 TB

 Multithreading

Unique highly multithreaded architecture

128 hardware threads per processor

Extreme parallelism, hides memory latency

Multithreaded Graph Database

Highly parallel in-memory RDF quad store

High performance inference engine

High performance parallel I/O

Industry Standard Front End

Based on Jena open-source semantic DB

All standard Linux infrastructure and languages

Lustre parallel file system

11

(12)

12

Graph partitioning is not required

 Large pool of uniform shared memory

High performance on irregular queries

 Highly parallel in-memory architecture

 Database architected for high degree of parallelism

New data may be added very rapidly

 High performance I/O

 No need to reconfigure schemas for new data types

Vertical scaling to very large systems

 Shared memory up to 512TB

 Parallelism up to 8192 processors x 128 threads

(13)

Cybersecurity Use Case: Discovering new threats

 Goal: Proactively identify unknown cyber threats by examining all relationships

 Data sets: IP, MAC, BGP, Firewall, DNS, Netflow, Whois, NVD, CIDR…

 Technical Challenges: Volume and Velocity of data; Temporal dependencies; Real-time

response

 Users: Cyber Analysts

 Usage model: Iterative analysis of all patterns across all traffic to explore deviations in

frequency of occurrence, derivative patterns of known threats and linking patterns through

relationships in offline data

 Augmenting: Existing data appliances

13

(14)

Relational Schema vs. Semantic Ontology

14

Cyber Security Data Set

(15)

15

The Challenge

Multiple massive datasets describing biological network graphs in cancer cells from published literature and experimental data, constantly updated

Un-partitionable, densely and irregularly connected graphs

Multiple researchers concurrently searching for relationships not found in published literature

Graph Analytic Solution

Urika holds un-partitionable genomic network graph in memory

Contrast experimental models and theories with published results to discover previously unknown relationships

Interactive, real-time access by multiple researchers

Business Value

Identify new pathways in cell models to repurpose existing medications

Confirmation of elevated VEGF levels by tissue microarray:

Discover new drugs and treatments

(16)

16

The Cancer Genome Atlas (TCGA) data combined with literature

relationships from Medline

 Protein domain interactions are extracted from TCGA using Random Forest classifier on Google Compute

 A Normalized Medline Distance (NMD) between proteins is calculated from

Medline titles and abstracts using a large cluster

 Public datasets from Uniprot and Pfam

5.3 Billion triples from heterogeneous data sets

 New data sets constantly being added- mutation data, patient data, methylation data, statistical analysis results and more…

Current database technology did not have the performance for useful full- scale queries across this data

GRAPH

Genes (TCGA)

Literature (Medline)

Protein Families

(PFAM) Protein Data

(UniProt) Mutation

Correlations (Statistical)

Patients (Health Records) Methylation

Query

(17)

17

Quickly adapt surveillance to new rules and patterns

The Challenge

Continuous stream of possible compliance events

New compliance rules require constant updates of signals and patterns

Current solutions require offline preparation and are based on rigid rules

Graph Analytic Solution

Entire relationship graph in memory

New Patterns/templates can be identified and added in real-time

Graphical interactive exploration of relationships between people, places, things, organizations, communications, etc.

Business Value

A responsive, flexible event detection platform that adapts to new knowledge

Figure Source: Graph-based technologies for intelligence analysis,T. Coffman, S. Greenblatt, S. Marcus, Commun. ACM, 47(3):45-47, 2004

(18)

18

Semantic Graph Analytics is a valuable analytical tool for discovery processes

Scalability and query complexity depend strongly on hardware and software architecture

Semantic Graph Analytics is complementary to other data analysis techniques

Interactive discovery is critical to many use cases

(19)

19

Develop (and agree upon!) ontologies for more subject areas

 Some fields have good coverage, some poor

Express more datasets in RDF format

 Synergistic effect

New graph visualization and querying techniques need to be developed

 Enable discovery by non-SPARQL users

(20)

20

References

Related documents

[r]

International interns may be required, at the discretion of the Assistant Dean - Graduate Programs, to take an additional course as preparation for the International

Site visit for initial assessment of damage Obtain damage estimate from repair shop Tasks and Subcases Police Report Physician Statement Interview transcripts Damage assessment

The thesis mainly aims to investigate the role of Asian Infrastructure Investment Bank in promoting economic growth in Asian developing countries. The long term relationship

As with this study, future studies might work towards identifying the types of scaffolding strategies used in the teaching of the math course both by practicing

When considering as well the errors associated to (i) the assignment of continuous tree diameters from discrete diameter classes (i.e. specific function 1 ), (ii) the simulation

For example, our Model 1 (OLS) results are similar to Immergluck and Smith (2006b): the marginal foreclosure impact is about 1% of the sales price when the foreclosure is within

Review of previous studies indicates they have been conflicting results and this study sought to determine the relationship of organizational structure and internal