• No results found

Big Data Challenges to E-Discovery

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Challenges to E-Discovery"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

©2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Big Data Challenges

to E-Discovery

Paul Krneta, BMMsoft, Inc. Perry J. Narancic, Esq., LexAnalytica, PC

(2)

Agenda

I. Computing

 

Trends

II. Definition

 

of

 

“Big

 

Data”

III. Legal

 

and

 

Regulatory

 

aspects

 

of

 

Big

 

Data

IV. Technology

 

approaches

 

to

 

Big

 

Data

 

in

 

E

Discovery

V. In

 

the

  

Trenches

 

with

 

Big

 

Data:

A. EDRM

 

Model

 

and

 

Big

 

Data

B. Case

 

Study:

 

Yahoo v. OnlineNIC, No. 08‐5698‐JF (ND Cal. filed 

Dec. 19, 2008), Verizon v. OnlineNIC, No. 08‐2832 (ND Cal. filed June 

6, 2008 )Microsoft v. OnlineNIC, No., 08‐4648 (ND Cal. filed Oct. 7, 

2008).

(3)

I.

       

Computing

 

Trends:

 

the

 

progress

 

1. CPU,

 

disk,

 

RAM,

 

networks:

  

2x

 

faster/bigger

 

in

 

18

 

months

 

• 10x faster/bigger in 5 years,  100x in 10 years;  10,000x in 20 years

2. Price

 

of

 

HW

 

and

 

networks

 

is

 

dropping

 

as

 

fast

 

as

 

speed

 

is

 

growing

• Same $ buys you 10x faster/bigger system in 5 years,  100x in 10 years

3. Can

 

your

 

application

 

SW

 

take

 

advantage

 

of

 

HW

 

progress

 

?

 

• Most SW is not designed with HW progress in mind

• 10‐year old SW was designed for 100x slower HW

• 20‐year old SW was designed for 10,000x slower HW

(4)

I.

       

Computing

 

Trends,

 

cont’d

1. Structured

 

Data:

  

• 80% of all corporate  records (sales, phone records, payment data, stock 

trades) are structured and stored in relational database ‐ but due to their 

small size (typ. < 1 KB) they represent only  20% of enterprise data volume

• SQL is the main analytic tool for structured data  

2. Unstructured

 

data

 

• emails, documents, multimedia  represent over 80% of enterprise data 

volume (due to their large size of typ. 100 KB)  – but less than 20% in terms 

of number of records

• Text search is the main tool to search unstructured data 

• Archiving of unstructured data is typ. separate from Text search

3. Structured

 

data

 

=

 

huge

 

number

 

of

 

small

 

records

(5)

II.

     

Big

 

Data

 

=

 

size,

 

speed

 

and

 

diversity

Big

 

Data

 

happens

 

when

 

Moore’s

 

law

 

meets

 

data

 

growth

• Data volume doubles  every 18 months

Big

 

Data

 

Challenges

 

:

• Loading and indexing speed of Big Data has to be very high

• Search speed of Big Data must be very high

• Critical: SQL analysis and Text search must be unified 

• Storage cost for Big Data can be astronomical – if not done right

examples

 

of

 

Big

 

Data:

  

databases

 

records,

 

emails,

 

documents,

 

SMS,

 

video,

 

audio,

 

sensor

 

data,

 

social

 

media

 

and

 

more

To

 

discover

 

new

 

information

 

and

 

relationships

 

arising

 

from

 

the

 

cross

correlation

 

of

 

the

 

entire

 

data

 

set,

 

rather

 

than

 

isolated

 

silos

 

(6)

II.

  

Big

 

Data,

 

a

 

marketing

 

hype

 

?

Of

 

course,

 

there

 

is

 

Big

 

Data

 

hype

 

– just

 

like

 

there

 

was

 

Web

 

hype,

 

CRM

 

hype

 

etc.

But

 

Big

 

Data

 

is

 

nothing

 

new

 

– other

 

than

 

a

 

name

Big

 

Data

 

is

 

here

 

to

 

stay

 

and

 

burden

 

us

– See e.g. Steve Lohr, “How Big Data Became so Big”, NYT, Aug. 11, 2012

.

http://www.nytimes.com/2012/08/12/business/how‐big‐data‐became‐so‐big‐unboxed.html

– Bryant, Katz, Lazowska, Big Data Computing: Creating revolutionary 

breakthroughs in commerce, science and society”, Computing 

(7)

II.

 

Big

 

Data

 

Applications

 

Long

 

and

 

diverse

 

list

 

of

 

Big

 

Data

 

uses

Civil

 

and

 

Criminal

 

Litigation

 

– E

Discovery

Regulation

 

(e.g.

 

SEC,

 

HIPPA)

Internal

 

Policing

 

and

 

Dispute

 

Avoidance

 

(internal

 

surveillance

 

to

 

catch

 

problems,

 

like

 

harassment,

 

fraud,

 

etc..)

National

 

Security

 

(cross

correlating

 

addresses,

 

names,

 

phone

 

numbers)

Audit

(8)

III.

  

Technology

 

Approaches

 

to

 

Big

 

Data

Multiple

 

approaches

 

to

 

Big

 

Data:

“federation”:

 

applications

 

search

 

multiple

 

silos

 

of

 

data

– Relies on original data repository (“data silo”) to search data 

– Slow, expensive, unreliable

In

place

 

Indexing

 

(similar

 

to

 

Google)

– Data sources are scanned and indexed  to enable search

– Original data is not managed (=can be modified/deleted, no “HOLD”) 

EDMT:

  

– EDMT stores emails, Documents, Multimedia and DB Transactions for data 

compliance, retention and analysis

– EDMT runs Text+SQL cross‐analysis of all data to allow eDiscovery, Audit, 

Fraud Detection, CRM, GRC, BI etc. to run  up to 1,000x faster than before

(9)

III.

    

1

 

PB

 

of

 

data

 

in

 

EDMT:

  

2007

 

vs.

 

2012

• 1 PB (1,030 TB) of data 

• 6 Trillion records loaded +indexed  • Loading+indexing speed :  

• 285 B records per day  • 35 TB/day

• Fresh Data Loaded in < 2 sec • Search time :   < 0.5 sec

(10)

III.

  

EDMT

 

– smaller/bigger

 

than

 

1

 

PB

 

store + index index only

7 16K  [ 96 racks ] 15,360 41,472 180 B 1,800 B 640 Trillion

6 4K    [ 24 racks ] 3,840 10,368 48 B 480 B 160 Trillion

5 1K      [ 6 racks ] 960 2,592 12 B 120 B 42 Trillion

4 4XL  [ Full rack ] 160 432 2 B 20 B 7 Trillion

3 PB    [ 1/2 rack ] 80 288 1.6 B 16 B 6 Trillion 2 XL     [ 1/3 rack ] 40 144 600 M 6 B 2 Trillion 1 L       [ 1/4 rack ] 24 72 300 M 3 B 1 Trillion M      [ 2 RU ] 12 36 150 M 1.5 B 500 B S       [ 2RU ] 6 36 150 M 1.5 B 500 B XS       [ 2RU ] 4 36 150 M 1.5 B 500 B On Li n e

EDMT®

 

Solution

 

Models

 

and

 

Specifications

Model Description #    

cores

Disk      

Size (TB)

# emails & files (100KB each) DB rows       

(150‐byte)

Mid Entry 

(11)

IV.

 

In

 

the

 

Trenches

 

with

 

Big

 

Data

Big

 

Data

 

Challenges

 

in

 

E

Discovery

Volume

Heterogeneity

Distributed

(12)

IV.

 

In

 

the

 

Trenches

 

with

 

Big

 

Data,

 

cont’d

The

 

EDRM

 

Model

 

and

 

Big

 

Data

Information

 

Management:

 

The

 

problem

 

of

 

silo

searching.

 

The

 

need

 

for

 

speed,

 

completeness,

 

comprehensiveness.

Identification

 

– the

 

problems

 

with

 

the

 

traditional

 

“custodian”

 

and

 

“keyword”

 

approaches.

 

National

 

Day

 

Laborer

 

Org.

 

v.

 

US,

 

2012

 

WL

 

2878130

 

(SDNY)

Preservation

 

and

 

Collection

 

– problems

 

with

 

custodian

 ‐

based

 

preservation.

(13)

IV.

 

In

 

the

 

Trenches

 

with

 

Big

 

Data,

 

cont’d

Case

 

Study:

 

Cross

Analysis

 

of

 

Structured

 

and

 

Unstructured

 

Data

 

in

 

the

 

OnlineNIC Trilogy

Facts:

 

Yahoo,

 

Microsoft

 

and

 

Verizon

 

sued

 

OnlineNIC,

 

one

 

of

 

the

 

largest

 

domain

 

name

 

registrars

 

in

 

the

 

world

 

for

 

cybersquatting.

Alleged

 

that

 

defendant

 

was

 

using

 

aliases

 

to

 

register

 

infringing

 

domains.

Defense

 

used

 

BMMsoft technology

 

to

 

collect

 

and

 

cross

correlate

 

registration

 

database

 

and

 

emails

 

to

 

show

 

that

 

the

 

registrants

 

were

 

not

 

affiliated

 

with

 

defendant

Yahoo

 

and

 

Verizon

 

cases

 

settled,

 

Microsoft

 

case

 

was

 

(14)
(15)
(16)

Conclusion

1. Big

 

Data

 

is

 

real.

2. Big

 

Data

 

poses

 

unique

 

issues

 

for

 

E

Discovery

3. Technology

 

solutions

 

need

 

to

 

be

 

big,

 

and

 

unify

 

structured

 

and

 

unstructured

 

data

 

into

 

a

 

(17)

THANK

 

YOU

Paul

 

Krneta

[email protected]

(18)
(19)

EDMT

 

extracts

 

benefits

 

from

 

Big

 

Data

EDMT

 

advantages:

• Scalability & Performance

• Enterprise Ready

• Fast deployment, Easy 

maintenance

Use

 

cases:

FLAC (Fraud, Legal, Audit, 

Compliance) • BI & DW: Better  Customer Perspective  • IT: Operational Efficiency

EDMT

Big

 

Data

 

2.0

BI (SQL) FLAC (Text)

References

Related documents

The fourth part of your submitted proposal should (1) state the auditor’s preference for whether the County or the auditor should prepare the majority of year-end adjusting

As a kind of folk manual cotton textile in Shandong, Lu brocade takes cotton as main raw material, and adopts manual spinning, manually dyeing, and manual weaving, achieving

- To display computer image, press Camera/PC button on the control panel or remote control to switch AVerVision M70 to computer mode.. - For laptop to output display image, use

Followi- ng a descriptive presentation that traces the German promotion of renewable energy technologies since 2000, we draw on two stated-preference surveys conducted in 2013 and

Test sites were chosen from service requests received, while the control sites were selected from nearby areas that the residents were informed that their property would be treated

Our Cancer Committee includes board-certified physicians from medical specialties that diagnose or treat cancer, including: Surgery, Medical Oncology,

Fig.10 Load-displacement history of [+45/-45] 2s composite tubes from the test and simulation.. The cracks stop at the transition of fine mesh to coarse mesh, firstly it is

The primary goal of this study was to form an initial assessment of the impact on the amine solvent from coal combustion flue gas contaminants and the potential higher oxygen